# Combining all data from r/Batman and r/Joker

In this notebook, I will combine all of the data that I have collected from each subreddit. I have made the decision to work only with titles, as many of the selftext fields were either empty or contained non-text content. This will save time cleaning my data, while still offering a significant corpus to use for training my models

### Import Pandas

In [2]:
import pandas as pd

### Read in .csv files

In [3]:
batman_titles = pd.read_csv('batmantitles.csv')
batman_titles1 = pd.read_csv('batmantitles1.csv')
joker_titles = pd.read_csv('jokertitles.csv')
joker_titles1 = pd.read_csv('jokertitles1.csv')

In [7]:
batman_titles.shape

(1089, 3)

In [8]:
batman_titles1.shape

(991, 3)

In [11]:
joker_titles.shape

(1075, 3)

In [12]:
joker_titles1.shape

(1075, 3)

It appears that all of the data I'm importing looks correct. I'll create a dataframe for each subreddit before combining the two into a master dataframe, which we'll use for classification.

### Batman Dataframe

In [18]:
batman = pd.concat([batman_titles, batman_titles1], axis=0)

In [19]:
batman.head()

Unnamed: 0.1,Unnamed: 0,posts,label
0,0,Weekly Batman Comics (12/12/2018): The Batman ...,1
1,1,Weekly Batman Discussion Thread - Which crimin...,1
2,2,Wholesome,1
3,3,Hardy tells a story about Bale during TDKR´s f...,1
4,4,'Batman Villains' by Glen Orbik and Laurel Ble...,1


In [7]:
batman.shape

(2080, 3)

### Drop 'Unnamed: 0'
This is simply an artifact of the read-in process and not necessary to our project, so we'll drop it. 

In [20]:
batman.drop('Unnamed: 0', axis = 1, inplace=True)

In [21]:
batman.head()

Unnamed: 0,posts,label
0,Weekly Batman Comics (12/12/2018): The Batman ...,1
1,Weekly Batman Discussion Thread - Which crimin...,1
2,Wholesome,1
3,Hardy tells a story about Bale during TDKR´s f...,1
4,'Batman Villains' by Glen Orbik and Laurel Ble...,1


### Joker Dataframe

In [8]:
joker = pd.concat([joker_titles, joker_titles1], axis=0)

In [9]:
joker.head()

Unnamed: 0.1,Unnamed: 0,posts,label
0,0,Got my first tattoo! Credits to _alexbadea_ on...,1
1,1,The Joker as portrayed in The Dark Knight was ...,1
2,2,I did a Joker Sculpture and wanted to share it...,1
3,3,"This is a very atmospheric, eerie playlist to ...",1
4,4,da fuq?,1


In [11]:
joker.shape

(2150, 3)

As above in our Batman dataframe, I need to drop 'Unnamed:0'

In [22]:
joker.drop('Unnamed: 0', axis = 1, inplace=True)

### Assigning Binary labels

Seeing as we're solving a binary classification problem, I need to identify a positive and negative class. This was neglected in the previous notebook. Batman has already been assigned the '1' label, so we'll reassign Joker as '0'.

In [27]:
joker['label'].replace("1","0", inplace=True)

In [28]:
joker.head()

Unnamed: 0,posts,label
0,Got my first tattoo! Credits to _alexbadea_ on...,1
1,The Joker as portrayed in The Dark Knight was ...,1
2,I did a Joker Sculpture and wanted to share it...,1
3,"This is a very atmospheric, eerie playlist to ...",1
4,da fuq?,1


In [33]:
joker['label'] = joker['label'].map({'1': '0'})

In [34]:
joker.head()

Unnamed: 0,posts,label
0,Got my first tattoo! Credits to _alexbadea_ on...,
1,The Joker as portrayed in The Dark Knight was ...,
2,I did a Joker Sculpture and wanted to share it...,
3,"This is a very atmospheric, eerie playlist to ...",
4,da fuq?,


In [36]:
joker['label'].fillna('0', inplace=True)

In [37]:
joker.head()

Unnamed: 0,posts,label
0,Got my first tattoo! Credits to _alexbadea_ on...,0
1,The Joker as portrayed in The Dark Knight was ...,0
2,I did a Joker Sculpture and wanted to share it...,0
3,"This is a very atmospheric, eerie playlist to ...",0
4,da fuq?,0


### Combining Joker and Batman dataframes

The dataframe I'll create below will be used as the basis for my modeling process. 

In [38]:
you_complete_me = pd.concat([batman, joker], axis=0)

In [40]:
you_complete_me.tail()

Unnamed: 0,posts,label
1070,First official image of Joaquin Phoenix in Joker,0
1071,A few weeks ago my friends and I did a few pho...,0
1072,My by theory about the Joker's possible mental...,0
1073,Top 10 Batman Arkham Villains,0
1074,Joker is always watching me,0


In [41]:
you_complete_me.shape

(4230, 2)

In [42]:
you_complete_me.to_csv('joker_batman.csv')

In [43]:
you_complete_me.isnull().sum()

posts    0
label    0
dtype: int64

In [47]:
you_complete_me['label'].value_counts()

0    2150
1    2080
Name: label, dtype: int64

## At last, my data is stored in a dataframe and labelled for classification. In the next notebook, I'll subject this data to Natural Language Processing techniques in order to train a variety of classifier models and explore the results.