## Add the Target Variable
Now that the data from r/Democrats and r/Conservative have been acquired using the API.  The target variable must be added.  The study is not trying to predict a particular group so the assignment of the values to the target variable is arbitrary.  In this case r/Democrat will be assigned 0 and the r/Conservative will be assigned 1. 

In [73]:
import pandas as pd

In [74]:
dems = pd.read_csv('./data/democratic.csv')
reps = pd.read_csv('./data/republican.csv')


In [75]:
dems.head(1)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,backpackwayne,,,,[],,,...,,,THE TIME FOR UNITY IS NOW - A Progressive and ...,79,https://www.reddit.com/r/democrats/comments/95...,[],,False,all_ads,6


In [76]:
dems.shape


(876, 97)

In [77]:
reps.head(3)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,tehForce,,,,"[{'e': 'text', 't': 'Liberal Education != Purp...",,Liberal Education != Purple Hair,...,,,Michael Oakeshott - this week's sidebar tribute,41,https://www.reddit.com/r/Conservative/comments...,[],,False,all_ads,6
1,,,False,thatrightwinger,,,Far-Right,"[{'e': 'text', 't': ""Don't Tread on Me""}]",9b86186a-8b38-11e1-8f58-12313d2c1af1,Don't Tread on Me,...,105.0,140.0,Watch Live: Judge Brett Kavanaugh's confirmati...,91,https://www.youtube.com/watch?v=h8il9dAhA00,[],,False,all_ads,6
2,,,False,chabanais,,,Bold,"[{'e': 'text', 't': 'Stronger than derp'}]",,Stronger than derp,...,140.0,140.0,Ben Shapiro lays them away...,1426,https://i.redd.it/i3xkb1sgb8k11.jpg,[],,False,all_ads,6


In [78]:
reps.shape

(861, 97)

Both data sets have 97 variables.  r/democrat has 876 entries and the r/Conservative has 861 entries.  

### Adding target value

In [79]:
dems['target'] = 0
dems['target'].head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

In [80]:
reps['target'] = 1
reps['target'].head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### Merging the dataframes

Each data from should now have 98 features and they can be merged together for analysis

In [81]:
frames = [dems, reps]
comb = pd.concat(frames, axis = 0)

In [82]:
comb.shape

(1737, 98)

It looks as if the data frames combined correctly.  Still have 94 features and have 1737 cases which is 861 + 876

Next the excess variables will be striped and the DataFrame will be save to a csv to be read into an analysis notebook.  The vectorization will be done with the analysis, since the parameters of the vectorization will be altered to maximize both the prediction and the explanatory power of the model.

In [83]:
comb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1737 entries, 0 to 860
Data columns (total 98 columns):
approved_at_utc                  0 non-null float64
approved_by                      0 non-null float64
archived                         1737 non-null bool
author                           1737 non-null object
author_cakeday                   2 non-null object
author_flair_background_color    0 non-null float64
author_flair_css_class           261 non-null object
author_flair_richtext            1735 non-null object
author_flair_template_id         175 non-null object
author_flair_text                552 non-null object
author_flair_text_color          554 non-null object
author_flair_type                1735 non-null object
author_fullname                  1735 non-null object
banned_at_utc                    0 non-null float64
banned_by                        0 non-null float64
can_gild                         1737 non-null bool
can_mod_post                     1737 non-null bool

In [62]:
#drop columns where all of the data is missing will eliminate 18 columns from the dataframe
comb = comb.dropna(axis = 1, how = "all")


In [63]:
comb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1737 entries, 0 to 860
Data columns (total 80 columns):
archived                    1737 non-null bool
author                      1737 non-null object
author_cakeday              2 non-null object
author_flair_css_class      261 non-null object
author_flair_richtext       1735 non-null object
author_flair_template_id    175 non-null object
author_flair_text           552 non-null object
author_flair_text_color     554 non-null object
author_flair_type           1735 non-null object
author_fullname             1735 non-null object
can_gild                    1737 non-null bool
can_mod_post                1737 non-null bool
clicked                     1737 non-null bool
contest_mode                1737 non-null bool
created                     1737 non-null float64
created_utc                 1737 non-null float64
crosspost_parent            97 non-null object
crosspost_parent_list       97 non-null object
distinguished               5 n

### Get rid of booleans and int variables.

The goal of the analysis is to explore the differences in language between the Democratic and Republican parties.  Therefore, the boolean and integer variables are not necessary for the analysis.  It is possible that they might help predict the reddit the post came from but they will not provide any insight into the variance in language used by each party.  This study wants to focus on the language difference between the posts.  

In [67]:
comb = comb.select_dtypes(include = "object")

In [69]:
comb.shape


(1737, 42)

In [71]:
comb.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1737 entries, 0 to 860
Data columns (total 42 columns):
author                      1737 non-null object
author_cakeday              2 non-null object
author_flair_css_class      261 non-null object
author_flair_richtext       1735 non-null object
author_flair_template_id    175 non-null object
author_flair_text           552 non-null object
author_flair_text_color     554 non-null object
author_flair_type           1735 non-null object
author_fullname             1735 non-null object
crosspost_parent            97 non-null object
crosspost_parent_list       97 non-null object
distinguished               5 non-null object
domain                      1737 non-null object
edited                      1737 non-null object
id                          1737 non-null object
link_flair_css_class        212 non-null object
link_flair_richtext         1737 non-null object
link_flair_template_id      17 non-null object
link_flair_text             2

### Search for Features to be included in analaysis

In [64]:
comb.title.head(4)

0    THE TIME FOR UNITY IS NOW - A Progressive and ...
1                                            Shameful.
2    Kavanaugh Hearing Erupts in Chaos as Dems Dema...
3    Democrats are driving home that Republicans ke...
Name: title, dtype: object

The title will contain information that will be helpful in sorting out issues and words that identify Democratic posts v. Conservative posts

In [85]:
features = ['title', 'target']
test_data = comb[features]

In [87]:
test_data.to_csv('./data/combined_data.csv', index = False)