# Data Cleaning 

## Project 3 


### By: Patrick L. Cavins 

## Import the Collected Data

- import the CSV's from the previous notebooks 
- remove some unneccessary columns 
- investigate the nulls 
   - remove? No...to many of them nearly 300 
   - convert them? 
- lower all the text 
- Combine my features to use as bag-of-words approach 


In [2]:
# import libraries 
import pandas as pd 
from nltk import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

In [3]:
# import our data frames from previous notebook 

biology_df = pd.read_csv('./data/biology_data_reddit.csv')

chemistry_df = pd.read_csv('./data/chemistry_data_reddit.csv')

In [4]:
# dropping some irrelevant columns 
biology_df.drop(columns=['Unnamed: 0'], inplace=True)
chemistry_df.drop(columns=['Unnamed: 0'], inplace=True)

In [5]:
#Checking the dtypes 
print (biology_df.info())
print (chemistry_df.info())

# Checking for Nulls in the dataframes 
print (biology_df.isnull().sum())
print (chemistry_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1232 entries, 0 to 1231
Data columns (total 2 columns):
text     870 non-null object
title    1232 non-null object
dtypes: object(2)
memory usage: 19.3+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1252 entries, 0 to 1251
Data columns (total 2 columns):
text     699 non-null object
title    1252 non-null object
dtypes: object(2)
memory usage: 19.6+ KB
None
text     362
title      0
dtype: int64
text     553
title      0
dtype: int64


## Converting Components of the Data 

### Lowercase for the text 
- lambda function using .lower()

### Cleaning up the NaN's 
- numerous posts from above had NaN values (just titles and no text) 
- using a **stop_words** approach, and replace them 'the' (essentially elimating them, but still keeping the entry in the dataset 
- I want to use a bag of words approach ~ combine the text and title into a single feature 

### Generate y 
- Create a column called ```label``` that be either chemistry or biology subreddit's 

### Join the Df's into a single column 
- use pd.concat 
- binarize the ```label``` column ```{'chemistry': 0, 'biology':1}``` 

In [7]:
#lets lower all the case 

biology_df = biology_df.apply(lambda col: col.str.lower()) #converting all the case of the text

chemistry_df = chemistry_df.apply(lambda col: col.str.lower()) # converting all the case of the text

#cleaning up the nan

biology_df['text'].replace(np.nan, 'the', inplace=True)
chemistry_df['text'].replace(np.nan, 'the', inplace=True)

# comibining some features 
biology_df['title_and_text'] = biology_df['text'] + biology_df['title']

chemistry_df['title_and_text']  = chemistry_df['text'] + chemistry_df['title']


# generate our y  
biology_df['label']  = 'biology'
chemistry_df['label'] = 'chemistry'

In [8]:
# we want to generate a stop word token? 
biology_df.head()

Unnamed: 0,text,title,title_and_text,label
0,the,we got /r/biochemistry back from the abyss! al...,thewe got /r/biochemistry back from the abyss!...,biology
1,not sure if there is a thread like this but..\...,best made products for women doing fieldwork?,not sure if there is a thread like this but..\...,biology
2,https://i.redd.it/ctnwgz9golp21.jpg\n\nrhinogr...,"rhinogradentia, a long believed order of extin...",https://i.redd.it/ctnwgz9golp21.jpg\n\nrhinogr...,biology
3,how is protein synthesis related to the expres...,"protein synthesis, any help would be extremely...",how is protein synthesis related to the expres...,biology
4,"hey reddit,\n\ni’ve been a lurker for a while,...",laboratory career transition. help!,"hey reddit,\n\ni’ve been a lurker for a while,...",biology


In [9]:
chemistry_df.head() 

Unnamed: 0,text,title,title_and_text,label
0,"**intro**\n \n hello everyone, welcome bac...",[2019/03/29] synthetic challenge #78,"**intro**\n \n hello everyone, welcome bac...",chemistry
1,this is a dedicated weekly thread for you to s...,weekly careers/education questions thread,this is a dedicated weekly thread for you to s...,chemistry
2,the,i was doing the rotameter calibration by a old...,thei was doing the rotameter calibration by a ...,chemistry
3,the,a traffic light reaction i had to make for my ...,thea traffic light reaction i had to make for ...,chemistry
4,the,"this ""black fire"" i made using a low pressure ...","thethis ""black fire"" i made using a low pressu...",chemistry


In [10]:
## We need to join the dataframes 
join_df = pd.concat([chemistry_df, biology_df], axis=0)

In [11]:
## We need to binarize the label 
join_df['label'] = join_df['label'].map({'chemistry': 0, 'biology':1})

join_df.head()

Unnamed: 0,text,title,title_and_text,label
0,"**intro**\n \n hello everyone, welcome bac...",[2019/03/29] synthetic challenge #78,"**intro**\n \n hello everyone, welcome bac...",0
1,this is a dedicated weekly thread for you to s...,weekly careers/education questions thread,this is a dedicated weekly thread for you to s...,0
2,the,i was doing the rotameter calibration by a old...,thei was doing the rotameter calibration by a ...,0
3,the,a traffic light reaction i had to make for my ...,thea traffic light reaction i had to make for ...,0
4,the,"this ""black fire"" i made using a low pressure ...","thethis ""black fire"" i made using a low pressu...",0


In [12]:
#export to csv 

join_df.to_csv('./data/join_df_biology_chemistry.csv', index=False)

## Vectorize! 

### CountVectorizer 
- stop_words = 'english'
- max_features= 700  
- token_pattern= "[A-z]{2,}[\d]*")



In [13]:
#Instantiate the a vectorizers 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


cvec = CountVectorizer(stop_words = 'english', # setting the stop_words
                       max_features= 700,  # max_features, 
                       token_pattern= "[A-z]{2,}[\d]*") #using a regex pattern to remove numeric features 

# tvec = TfidfVectorizer()

In [14]:
# cvec.fit_transform(biology_df['title_and_text']).toarray()

In [15]:
biology_cvec_df = pd.DataFrame(cvec.fit_transform(biology_df['title_and_text']).toarray(), #vectorize
                               columns=cvec.get_feature_names())

In [16]:
biology_cvec_df.head()

Unnamed: 0,[day,[https,\],able,academic,activated,activity,actual,actually,add,...,worried,worth,wouldn,write,writing,wrong,www,year,years,youtube
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
biology_cvec_df.sum()

[day                14
[https              49
\]                  12
able                88
academic            14
activated           12
activity            15
actual              18
actually            45
add                 17
adrenal             14
adult               17
advance             35
advice              74
affect              15
ago                 21
ahukewjnodvz763     12
air                 22
amazing             12
amp                355
amygdala            12
analysis            15
animal              56
animals             57
answer              38
answers             25
antibiotics         15
antibodies          15
anybody             17
appreciate          27
                  ... 
waders              12
wall                14
want               155
wanted              28
wants               12
watch               28
water               52
way                 83
ways                16
web                 12
week                21
went                12
white      

In [18]:
chemistry_cvec_df = pd.DataFrame(cvec.fit_transform(chemistry_df['title_and_text']).toarray(),
                                 columns=cvec.get_feature_names())

In [19]:
chemistry_cvec_df.head()

Unnamed: 0,[fe,[structure,\[31,\\\__o__,\],a],able,acceptable,access,achieve,...,world,ww5,www,yam,yard,yea,yhy5,zeolite,zinc,zooming
0,0,4,0,0,0,1,0,0,0,0,...,0,1,0,0,0,1,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
chem_list = chemistry_cvec_df.columns.values.tolist()

In [21]:
bio_list = biology_cvec_df.columns.values.tolist()

In [22]:
chemistry_cvec_df.sum().sort_values(ascending=False)

amp            491
oxidation      343
copper         295
water          257
like           253
using          249
does           249
chemistry      247
air            245
hot            245
ask            245
nickel         245
thanks         200
acid           200
way            198
thought        197
grade          196
ions           196
reagent        196
blood          196
know           156
test           154
help           152
want           152
ve             149
research       149
use            149
analysis       148
jdx            147
file           147
              ... 
soup             1
soot             1
iic              1
includes         1
hexane           1
incorporate      1
increasing       1
soluble          1
soapy            1
inform           1
smoke            1
hey              1
speed            1
fine             1
generate         1
flame            1
submissions      1
submission       1
following        1
form             1
pinot            1
stuck       

In [23]:
biology_cvec_df.sum().sort_values(ascending=False)

biology      397
like         360
amp          355
know         304
just         274
https        202
does         187
dna          185
work         185
ve           164
really       164
research     161
don          160
want         155
cell         154
cells        150
help         149
need         145
com          136
question     134
time         129
lab          119
year         118
people       115
school       113
think        112
www          109
good         104
different    102
thanks        97
            ... 
drinking      12
double        12
direct        12
developed     12
develop       12
close         12
imgrc         12
prevent       12
march         12
practice      12
original      12
bih           12
notes         12
nad           12
biw           12
million       12
medicine      12
master        12
male          12
important     12
main          12
lnms          12
known         12
isch          12
involved      12
intro         12
insects       12
check         

In [24]:
both_list = []

for word in chem_list:
    if word in bio_list:
        both_list.append(word)

In [25]:
len(both_list)

199

In [None]:
both_list.

In [26]:
print (both_list)

['\\]', 'able', 'advance', 'advice', 'affect', 'air', 'amp', 'analysis', 'answer', 'appreciate', 'appreciated', 'ask', 'asking', 'background', 'bad', 'better', 'bio', 'bit', 'blood', 'called', 'came', 'careers', 'cause', 'cell', 'change', 'chemical', 'chemicals', 'chemistry', 'cold', 'com', 'complete', 'complex', 'computational', 'correct', 'create', 'data', 'determine', 'did', 'didn', 'different', 'does', 'doing', 'don', 'easy', 'education', 'effect', 'environmental', 'experiments', 'expert', 'explain', 'far', 'feel', 'fine', 'following', 'food', 'form', 'free', 'future', 'good', 'grade', 'great', 'guys', 'happen', 'haven', 'heard', 'hello', 'help', 'hey', 'hi', 'high', 'higher', 'homework', 'host', 'hot', 'https', 'idea', 'important', 'instead', 'interesting', 'intro', 'just', 'kind', 'know', 'lab', 'laboratory', 'learn', 'left', 'let', 'level', 'light', 'like', 'liquid', 'll', 'long', 'look', 'looked', 'looking', 'looks', 'lot', 'love', 'main', 'make', 'makes', 'making', 'maybe', 'm

In [27]:
len(chem_list)

700

In [28]:
len(bio_list)

700

### TF-IDF
- stop_words = 'english'
- max_features= 700  
- token_pattern= "[A-z]{2,}[\d]*")

In [29]:
tvec = TfidfVectorizer(stop_words='english',
                       max_features=700,
                       token_pattern = '[A-z]{2,}[\d]*')

In [30]:
biology_tvec_df = pd.DataFrame(tvec.fit_transform(biology_df['title_and_text']).toarray(), 
                               columns=tvec.get_feature_names())

In [31]:
biology_tvec_df

Unnamed: 0,[day,[https,\],able,academic,activated,activity,actual,actually,add,...,worried,worth,wouldn,write,writing,wrong,www,year,years,youtube
0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
1,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.077941,0.0,...,0.0,0.000000,0.090404,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
2,0.0,0.114901,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.215866,0.000000,0.000000,0.0
3,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
4,0.0,0.000000,0.0,0.095466,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
5,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
6,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.137718,0.000000,0.000000,0.000000,0.0
7,0.0,0.000000,0.0,0.478789,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
8,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
9,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.198856,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0


In [32]:
biology_tvec_df.sum().sort_values(ascending=False)

biology            46.565779
like               34.332796
know               32.197001
does               28.857895
just               26.253737
amp                25.053890
cells              23.184986
dna                23.060574
https              22.351600
cell               21.378055
help               20.842399
question           20.024099
work               19.015415
need               18.217570
research           17.950180
life               17.682992
ve                 17.208139
com                17.148382
really             17.039880
human              16.997230
body               16.693012
bacteria           16.627369
want               16.588321
good               15.911093
thewhat            15.683699
don                15.539648
time               14.705339
wondering          14.695566
thethe             14.550456
different          14.249548
                     ...    
option              1.548713
master              1.535739
waders              1.456731
materials     

In [33]:
chemistry_tvec_df = pd.DataFrame(tvec.fit_transform(chemistry_df['title_and_text']).toarray(),
                                columns=tvec.get_feature_names())

In [34]:
chemistry_tvec_df

Unnamed: 0,[fe,[structure,\[31,\\\__o__,\],a],able,acceptable,access,achieve,...,world,ww5,www,yam,yard,yea,yhy5,zeolite,zinc,zooming
0,0.000000,0.212214,0.000000,0.000000,0.000000,0.053053,0.000000,0.000000,0.00000,0.000000,...,0.0,0.053053,0.000000,0.000000,0.000000,0.053053,0.053053,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.000000,0.000000,0.556967,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.190632,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [35]:
chemistry_tvec_df.sum().sort_values(ascending=False)

[fe              52.965984
crystal          52.965984
iodine           42.670627
amp              40.183764
thewondered      39.481448
help             38.686142
experiments      35.789960
using            34.116103
acid             33.557678
themore          31.595922
thek3            31.595922
hot              31.166886
cell             31.063096
concentration    30.989178
thought          30.730184
ions             30.082469
seaweed          29.806425
themade          29.806425
cool             28.790106
mg               28.290163
nearest          28.290163
theexactly       28.290163
nitrate          27.660169
nitrite          27.660169
chemistry        27.543297
ask              26.699601
test             26.132949
water            26.054595
tubes            25.854596
blood            25.150728
                   ...    
complete          0.053053
long              0.053053
novel             0.053053
submissions       0.053053
ujy7              0.053053
words             0.053053
s

## Some Baseline Model Examination 

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import stop_words
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [37]:
## baseline accuracy 

join_df['label'].value_counts(normalize = True)

0    0.504026
1    0.495974
Name: label, dtype: float64

In [38]:
# Split the data 

X = join_df['title_and_text']
y = join_df['label']


#Train Test Split 

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state = 42,
                                                    stratify  = y) #stratify will maintain the baseline average

In [39]:
cvec = CountVectorizer(stop_words='english')

In [40]:
# cvec.fit_transform(X_train).toarray()

In [41]:
df_cvec = pd.DataFrame(cvec.fit_transform(X_train).toarray(), columns=cvec.get_feature_names())

df_cvec.head()

Unnamed: 0,000,00404,01,019,02,03,03c7r9d3mvn21,04,05,06,...,zum,zurück,zusammenhänge,zz,zz_u1tc,ölverseuchten,𝗛𝗮𝗿𝗮𝗹𝗱𝘀𝗲𝗻,𝗢𝗻𝗱𝗿𝗮𝗰𝗸𝗮,𝗦𝘂𝗴𝗮,𝗧𝗼𝘆𝗼𝗱𝗮
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')), # this is the flow of the data through the model
    ('lgR', LogisticRegression(solver='liblinear'))
]) 

In [43]:
cross_val_score(pipe, X_train, y_train, cv=5).mean()

0.9892703583532659

In [44]:
# Fit your model
pipe.fit(X_train, y_train)

# Training score
print (pipe.score(X_train, y_train))

# Test score
print (pipe.score(X_test, y_test))

0.9994632313472893
0.9871175523349437


In [45]:
tvec = TfidfVectorizer(stop_words='english',
                       token_pattern= '')

In [46]:
# tvec.fit_transform(X_train).toarray()

In [47]:
df_tvec  = pd.DataFrame(tvec.fit_transform(X_train).toarray(),
                   columns=tvec.get_feature_names())
df_tvec.head()

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,Unnamed: 1
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


In [48]:
pipe2 = Pipeline([
    ('tvec', TfidfVectorizer()), # this is the flow of the data through the model
    ('lgR', LogisticRegression(solver='liblinear'))
]) 

In [49]:
cross_val_score(pipe2, X_train, y_train, cv=5).mean()

0.9903427444122471

In [50]:
# Fit your model
pipe2.fit(X_train, y_train)

# Training score
print (pipe2.score(X_train, y_train))

# Test score
print (pipe2.score(X_test, y_test))

0.9903381642512077
0.9855072463768116


In [51]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')), # this is the flow of the data through the model
    ('lgR', LogisticRegression(solver='liblinear'))
]) 

In [52]:
# params for a grid search 

params = { 'cvec__stop_words': [None, 'english'], #None and English are the options to tune over 
         'cvec__max_features': [np.linspace(1_000, 3_000)]} #changing word count Only the top 1,000 words from the entire corpus will be saved

cvec = CountVectorizer()
gs_cvec = GridSearchCV(pipe, param_grid=params, verbose=1)
gs_cvec.fit(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [None]:
#this is the control flow 
pipe = Pipeline([
    ('cvec', CountVectorizer()), # this is the flow of the data through the model
    ('lgR', LogisticRegression(solver='liblinear'))
]) 

In [None]:
# Tune GridSearchCV

params = { 'cvec__stop_words': [None, 'english'], #None and English are the options to tune over 
         'cvec__max_features': [np.linspace(1_000, 3_000)]} #changing word count Only the top 1,000 words from the entire corpus will be saved

gs = GridSearchCV(pipe, param_grid=params, cv=3)
gs.fit(X_train, y_train)

print (gs.best_score_) #cross_val_score
gs.best_params_
