## Add the Target Variable
Now that the data from r/Democrats and r/Conservative have been acquired using the API.  The target variable must be added.  The study is not trying to predict a particular group so the assignment of the values to the target variable is arbitrary.  In this case r/Democrat will be assigned 0 and the r/Conservative will be assigned 1. 

In [1]:
import pandas as pd

In [2]:
dems = pd.read_csv('./data/democratic.csv')
reps = pd.read_csv('./data/republican.csv')
swar = pd.read_csv('./data/starwars.csv')
strek = pd.read_csv('./data/startrek.csv')
books = pd.read_csv('data/books.csv')
tv = pd.read_csv('data/tv.csv')

In [3]:
dems.head(1)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,backpackwayne,,,,[],,,...,,,THE TIME FOR UNITY IS NOW - A Progressive and ...,80,https://www.reddit.com/r/democrats/comments/95...,[],,False,all_ads,6


In [4]:
dems.shape


(876, 97)

In [5]:
len(dems.title)

876

In [6]:
len(set(dems.title))  #seem to be 3 duplicate titles

873

Get rid of any duplicates that might have gotten into the data

In [7]:
dems.drop_duplicates(subset  = 'title', inplace = True)
reps.drop_duplicates(subset  = 'title', inplace = True)
swar.drop_duplicates(subset  = 'title', inplace = True)
strek.drop_duplicates(subset = 'title', inplace = True)
books.drop_duplicates(subset = 'title', inplace = True)
tv.drop_duplicates(subset    = 'title', inplace = True)

In [8]:
len(dems), len(reps), len(swar), len(strek), len(books), len(tv)

(873, 788, 675, 676, 420, 689)

We have 873 Democratic posts, 788 Republican posts, 675 Star Wars posts, 676 Star Trek posts, 420 Book posts and 689 tv posts

In [9]:
reps.head(3)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,Yosoff,,Libertarian,"[{'e': 'text', 't': 'First Principles'}]",771ee808-8b38-11e1-8706-12313d096aae,First Principles,dark,...,,,U.S. Constitution Discussion - Week 10 of 52 (...,31,https://www.reddit.com/r/Conservative/comments...,[],,False,all_ads,6
1,,,False,thatrightwinger,,Far-Right,"[{'e': 'text', 't': ""Don't Tread on Me""}]",9b86186a-8b38-11e1-8f58-12313d2c1af1,Don't Tread on Me,dark,...,81.0,140.0,Burt Reynolds: Leading Man &amp; Distillation ...,18,https://www.nationalreview.com/2018/03/burt-re...,[],,False,all_ads,6
2,,,False,chabanais,,Bold,"[{'e': 'text', 't': 'Stronger than derp'}]",,Stronger than derp,dark,...,105.0,140.0,Nike's best trick...,1997,https://i.redd.it/94w2w7ioumk11.jpg,[],,False,all_ads,6


In [10]:
reps.shape

(788, 96)

In [11]:
len(set(reps.title))

788

Both Political data have 96 variables.  r/democrat has 876 entries and the r/Conservative has 799 entries.  We have 11 duplicate titles for republicans and 3 for democrats.  Not sure if these are duplicate posts or different people posted same titles yet  

### Adding target value

In [12]:
dems['target'] = 0
reps['target'] = 1
dems['target'].head()  #democrats 0

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

In [13]:
strek['target'] = 0
swar['target'] = 1  #startrek 0 starwars 1

In [14]:
books['target'] = 0
tv['target'] = 1


### Merging the dataframes

Each data from should now have 98 features and they can be merged together for analysis

In [15]:
frames = [dems, reps]
comb = pd.concat(frames, axis = 0)

In [16]:
frames_1 = [strek, swar]
comb_space = pd.concat(frames_1, axis = 0)


In [17]:
frames_2 = [books, tv]
comb_entertainment = pd.concat(frames_2, axis = 0)

In [18]:
comb.reset_index(inplace = True)
comb_space.reset_index(inplace = True)
comb_entertainment.reset_index(inplace = True)

In [19]:
comb.shape

(1661, 99)

It looks as if the data frames combined correctly.  Still have 94 features and have 1737 cases which is 861 + 876

Next the excess variables will be striped and the DataFrame will be save to a csv to be read into an analysis notebook.  The vectorization will be done with the analysis, since the parameters of the vectorization will be altered to maximize both the prediction and the explanatory power of the model.

In [20]:
comb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1661 entries, 0 to 1660
Data columns (total 99 columns):
index                            1661 non-null int64
approved_at_utc                  0 non-null float64
approved_by                      0 non-null float64
archived                         1661 non-null bool
author                           1661 non-null object
author_cakeday                   1 non-null object
author_flair_background_color    0 non-null float64
author_flair_css_class           241 non-null object
author_flair_richtext            1658 non-null object
author_flair_template_id         155 non-null object
author_flair_text                499 non-null object
author_flair_text_color          502 non-null object
author_flair_type                1658 non-null object
author_fullname                  1658 non-null object
banned_at_utc                    0 non-null float64
banned_by                        0 non-null float64
can_gild                         1661 non-null bo

In [21]:
comb_space.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1351 entries, 0 to 1350
Data columns (total 99 columns):
index                            1351 non-null int64
approved_at_utc                  0 non-null float64
approved_by                      0 non-null float64
archived                         1351 non-null bool
author                           1351 non-null object
author_cakeday                   3 non-null object
author_flair_background_color    0 non-null float64
author_flair_css_class           119 non-null object
author_flair_richtext            1344 non-null object
author_flair_template_id         52 non-null object
author_flair_text                7 non-null object
author_flair_text_color          132 non-null object
author_flair_type                1344 non-null object
author_fullname                  1344 non-null object
banned_at_utc                    0 non-null float64
banned_by                        0 non-null float64
can_gild                         1351 non-null bool


Star war v. Star trek definitely has more self text responses.  Might that help in the analysis

In [22]:
#drop columns where all of the data is missing will eliminate 18 columns from the dataframe
comb = comb.dropna(axis = 1, how = "all")
# did some steps like this at first to figure out what variable were important.  
#in the end selftext and title were only cells with meaningful text for NLP

Only real text seems to be in the title and self text features.  The features will be combined to create one vector of text data for the count vectorization process

In [23]:
comb['selftext'].fillna("", inplace = True)
    
comb['text']  = comb['title'] + comb['selftext']



In [24]:
comb_space['selftext'].fillna("", inplace = True)
comb_space['text'] = comb_space['title'] + comb_space['selftext']

In [25]:
comb_entertainment['selftext'].fillna("", inplace = True)
comb_entertainment['text'] = comb_entertainment['title'] + comb_entertainment['selftext']

In [26]:
comb.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1661 entries, 0 to 1660
Data columns (total 83 columns):
index                       1661 non-null int64
archived                    1661 non-null bool
author                      1661 non-null object
author_cakeday              1 non-null object
author_flair_css_class      241 non-null object
author_flair_richtext       1658 non-null object
author_flair_template_id    155 non-null object
author_flair_text           499 non-null object
author_flair_text_color     502 non-null object
author_flair_type           1658 non-null object
author_fullname             1658 non-null object
can_gild                    1661 non-null bool
can_mod_post                1661 non-null bool
clicked                     1661 non-null bool
contest_mode                1661 non-null bool
created                     1661 non-null float64
created_utc                 1661 non-null float64
crosspost_parent            100 non-null object
crosspost_parent_list       

### Search for Features to be included in analaysis

In [27]:
comb.text.head(20)

0     THE TIME FOR UNITY IS NOW - A Progressive and ...
1     Nation Stunned That There Is Someone in White ...
2              Alex Jones Permanently Banned By Twitter
3     It's About Time Senate Democrats Showed Some D...
4                The Russians Hacked Kavanaugh’s Emails
5     Republicans keep trying to strip protections f...
6     Corey Booker Orders Release of Kavanaugh-Relat...
7     Nancy Pelosi Doesn’t Care What You Think of He...
8     Donald Trump lashed out at military for not ma...
9     Warren: Time to use 25th Amendment to remove T...
10                                 believe in something
11    Leaked Kavanaugh Documents From Time at White ...
12    Read the committee confidential document Cory ...
13    It Would Take Only a Single Senator to Check T...
14    Democrats force Senate to adjourn to protest K...
15    Joe Biden, Forceful at Times, Tears Into Trump...
16    Kavanaugh Confirmation Met With Cry Of Illegit...
17    Bob Woodward, Bane of Presidents, Confront

### Top word counts

In [28]:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
cv = CountVectorizer(max_features = 1000, stop_words = 'english')

In [29]:
X = comb['text']
y = comb[['target']]
cv.fit(X)
party_features = cv.transform(X);
#fit a basic count vectorization model to see common words in the model

In [30]:
party_df = pd.DataFrame(party_features.toarray(), columns=cv.get_feature_names())

In [31]:
party_df.drop(columns = "target", inplace = True)
party_df.shape


(1661, 999)

In [32]:
party_df = party_df.join(y.target)
party_df.target.value_counts()

0    873
1    788
Name: target, dtype: int64

In [33]:
party_df.groupby('target').sum().T.sort_values(0, ascending=False)
#look at the top 20 words in the data frame and see how they are dispersed between 
#the two groups

target,0,1
trump,656,131
president,169,15
like,131,20
party,95,8
house,89,20
white,88,24
mccain,87,44
democrats,86,29
republicans,85,10
republican,81,10


The top 20 words in the combined data frame are almost identical between Republicans and Democrats

In [34]:
X = comb_space['text']
y = comb_space[['target']]
cv.fit(X)
space_features = cv.transform(X);

In [35]:
space_features.shape

(1351, 1000)

In [36]:
space_df = pd.DataFrame(space_features.toarray(), columns=cv.get_feature_names())


In [37]:

space_df = space_df.join(y.target)
print(space_df.target.value_counts())


0    676
1    675
Name: target, dtype: int64


In [38]:
space_df.groupby('target').sum().T.sort_values(0, ascending=False)

target,0,1
trek,549,1
star,462,209
just,294,152
like,265,172
tng,217,0
episode,195,53
time,190,42
series,184,20
com,174,16
https,159,24


In [39]:
X = comb_entertainment['text']
y = comb_entertainment[['target']]
cv.fit(X)
entertainment_features = cv.transform(X);

In [40]:
entertainment_df = pd.DataFrame(entertainment_features.toarray(), columns=cv.get_feature_names())

In [41]:
entertainment_df = entertainment_df.join(y.target)

In [42]:
entertainment_df.groupby('target').sum().T.sort_values(0, ascending=False)

target,0,1
book,726,13
read,544,8
books,437,18
just,339,133
reading,331,3
com,319,55
like,307,143
www,276,44
ve,198,56
really,198,66


### Save data to csv for Analysis


In [47]:
features = ['text', 'target']
party_data = comb[features]

In [48]:

space_data = comb_space[features]

In [49]:
entertainment_data = comb_entertainment[features]

In [50]:
party_data.to_csv('./data/combined_data.csv', index = False)

In [51]:
space_data.to_csv('./data/space_data.csv', index = False)

In [52]:
entertainment_data.to_csv('./data/entertainment.csv')

In [56]:
#print out titles for analysis by people
print(comb_space.title[0:100])

0     State of the Subreddit: Flairs, Spoilers and C...
1     Chris Pine says he would love to do Star Trek ...
2     Thanks to Star Trek for making me feel like hu...
3                     Episodes to show an Ethics major?
4     Who are your overall 3 favorite and overall 3 ...
5     What do you think about J. Michael Straczynski...
6     Besides "JANEWAY MURDERED TUVIX" what in-unive...
7     What if Broken Bow, the Pilot of Enterprise, a...
8                              Trek Podcasts on Spotify
9     Recommendations on NX-01 model? What company i...
10    CBS reportedly negotiating exit for CEO Les Mo...
11    Is it just me or does the Federation really ne...
12    Would you prefer if future Star Trek movies fe...
13    Which captain would you feel the most comforta...
14                   Recommended watch order after TNG?
15                                       Trek in the UK
16    Just finished all 7 seasons of DS9. Sad that i...
17    I made this dumb little thing a while back