# Text Analysis With Python 

## Introduction - Lorin

To follow along...

Packages that need to be installed

pandas
sklearn


## Python Packages often used for Text Analysis - Rolando
- Directly related to Text Analysis
    - NLTK
    - spaCy
    - Textblob
    - Gensim
    - Transformers

- Useful tools for text analysis
    - Pandas
    - Scikit-Learn
    - Matplotlib


## Mini-project 1: Word Frequencies (1:05) - Rolando

#### Data: [Jane Eyre - Charlotte Brontë](https://www.gutenberg.org/files/1260/1260-h/1260-h.htm)
#### Tools: NLTK, Scitkit-learn, Pandas
#### Method: Simple N-grams (Document vs Chapters), Maybe TF-IDF, 

##### 1. Load in data

In [1]:
import os 

chapter_texts = []
chs_dir = './datasets/jane_eyre_bronte/'
for ch in os.listdir(chs_dir):
    ch_path = os.path.join(chs_dir, ch)
    with open(ch_path, 'r') as f:
        f.readline() # reading through the first line
        txt = f.readlines()
        chapter_texts.append(txt)

In [7]:
# the first ten lines from the first chapter
chapter_texts[0][:10]

['\n',
 '\n',
 'A splendid Midsummer shone over England: skies so pure, suns so radiant\n',
 'as were then seen in long succession, seldom favour even singly, our\n',
 'wave-girt land. It was as if a band of Italian days had come from the\n',
 'South, like a flock of glorious passenger birds, and lighted to rest\n',
 'them on the cliffs of Albion. The hay was all got in; the fields round\n',
 'Thornfield were green and shorn; the roads white and baked; the trees\n',
 'were in their dark prime; hedge and wood, full-leaved and deeply\n',
 'tinted, contrasted well with the sunny hue of the cleared meadows\n']

##### 2. Clean text

In [22]:
import re

# join the lines together first with nested list comprehensions
joined_chs = [''.join([line for line in ch if line != '\n']) for ch in chapter_texts]

# replace break lines (\n) with spaces
no_bl_chs = [re.sub(r'\n', ' ', ch) for ch in joined_chs]

# remove punctuation
no_punct_chs = [re.sub(r'[:;.,_“”\']', '', ch) for ch in no_bl_chs]

#### 3. Vectorize Text
Using Scikit-learn's [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TFIDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [35]:
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

stopwords = set(stopwords.words('english'))

# demonstrating some of the parameters within Count Vectorizer
vc = CountVectorizer(lowercase=True, 
                     stop_words=stopwords,
                     ngram_range=(1,1),
                     max_df=0.9,
                     min_df=0.1)

vectors = vc.fit_transform(no_punct_chs)


In [36]:
print(vc.vocabulary_)



In [37]:
word_vect_df = pd.DataFrame(data=vectors.toarray(), columns=vc.get_feature_names_out())
word_vect_df.sample(10)

Unnamed: 0,abandoned,abbot,abhor,able,abode,abrupt,abruptly,absence,absent,absolute,...,yellow,yesterday,yield,yielded,yielding,yonder,young,younger,youth,zeal
28,0,0,0,0,0,0,0,1,1,0,...,0,3,0,0,0,1,1,0,0,0
14,0,0,0,0,0,0,0,0,0,0,...,2,1,0,0,0,0,2,1,0,0
36,1,0,0,0,1,0,0,0,0,0,...,0,3,0,0,0,2,1,0,0,0
24,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
20,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,3,0,6,0,0,0
34,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,1,0
31,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,4,1,1,0
13,1,0,0,2,0,0,0,0,0,0,...,0,0,1,0,0,0,2,0,0,0
6,0,0,0,1,0,0,0,3,1,1,...,0,0,0,1,0,0,2,3,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,3,0,0,0,0,0,1


In [38]:
word_vect_df.sum(axis=0).sort_values(ascending=False).head(20)

rochester    366
jane         346
sir          316
miss         310
mrs          252
night        223
john         202
door         182
love         153
make         152
way          150
without      149
away         149
man          146
went         144
cannot       139
st           137
fairfax      137
head         136
back         135
dtype: int64

## Mini-project 2: Classification (1:15) - Lorin

#### Data: [On the Books Laws](https://cdr.lib.unc.edu/concern/data_sets/v405sk89q?locale=en)
#### Tools: Scitkit-learn, Pandas, 
#### Methods: Supervised (Jim Crow vs. Non-Jim Crow) vs Unsupervised (Topic-modeling)
#### Helpful: [Comparing classifiers](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)


Bring in labeled training set

In [38]:
import pandas as pd

df = pd.read_csv("datasets/otb_training_set.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1785 entries, 0 to 1784
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      1785 non-null   object
 1   source                  1785 non-null   object
 2   jim_crow                1785 non-null   int64 
 3   chapter_num             1785 non-null   int64 
 4   section_num             1785 non-null   int64 
 5   chapter_text            1785 non-null   object
 6   section_text            1785 non-null   object
 7   year                    1785 non-null   int64 
 8   type_private laws       1785 non-null   int64 
 9   type_public laws        1785 non-null   int64 
 10  type_public local laws  1785 non-null   int64 
 11  type_session laws       1785 non-null   int64 
dtypes: int64(8), object(4)
memory usage: 167.5+ KB


Includes 512 examples of Jim Crow laws and 1273 non Jim Crow laws.

In [12]:
df.jim_crow.value_counts()

0    1273
1     512
Name: jim_crow, dtype: int64

Laws were labeled as Jim Crow or Not Jim Crow according to scholarly works (Pauli Murray, Richard Paschal) and experts at UNC (William Sturkey, among others)

In [10]:
df.source.value_counts()

project experts    1673
paschal              74
murray               38
Name: source, dtype: int64

We need to pick a target for our classification, aka the "output".

In [31]:
target = df["jim_crow"]

What features do we want to train the models on? They will be our "inputs".

In [29]:
features = df.loc[:, "section_text" : "type_session laws"]
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1785 entries, 0 to 1784
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   section_text            1785 non-null   object
 1   year                    1785 non-null   int64 
 2   type_private laws       1785 non-null   uint8 
 3   type_public laws        1785 non-null   uint8 
 4   type_public local laws  1785 non-null   uint8 
 5   type_session laws       1785 non-null   uint8 
dtypes: int64(1), object(1), uint8(4)
memory usage: 35.0+ KB


Use train-test-split to separate data into training and testing sets. 80% will be training, 20% will be testing - set by test_size = 0.2.  Data is chosen for the different sets at random, so random_state allows us all to get the same results. X_train and X_test include the inputs. y_train and y_test include the output.

In [32]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 25)

## Mini-project 3: Sentiment Analysis (1:35) - Rolando

#### Data: [Chaptgpt Sentiment Analysis](https://www.kaggle.com/datasets/charunisa/chatgpt-sentiment-analysis)
#### Tools: Textblob, NLTK, spaCy, Transformers
#### Methods: Dictionary (+ rule-based) vs. Transformer-based approach