<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Importing-libraries" data-toc-modified-id="Importing-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing libraries</a></span></li><li><span><a href="#Uploading-csv-to-MongoDB" data-toc-modified-id="Uploading-csv-to-MongoDB-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Uploading csv to MongoDB</a></span></li></ul></div>

## Intro

This jupyter notebook is the first part of the api-sentiment-project. Here, we have downloaded a dataset from kaggle with lines from many episodes in the tv show South Park, we are cleaning the dataset and we are uploading the data to MongoDB. 

## Importing libraries

In [1]:
import pandas as pd

In [2]:
%config Completer.use_jedi = False

## Uploading csv to MongoDB

In [3]:
# connecting with MongoDB

from pymongo import MongoClient
client = MongoClient()

In [4]:
# creating new database and collection

character_col = client.api_sentiment_project.characters
message_col = client.api_sentiment_project.messages
episodes_col = client.api_sentiment_project.episodes

In [5]:
# load the csv file into a pandas DF

south_park_lines = pd.read_csv('../data/sp_all-seasons.csv', header = 0)

In [6]:
south_park_lines.head()

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n
3,10,1,Chef,I'm sorry boys.\n
4,10,1,Stan,"Chef said he's been bored, so he joining a gro..."


In [7]:
south_park_lines.shape

(70896, 4)

In [8]:
south_park_lines.columns = south_park_lines.columns.str.lower()

In [9]:
sp_cols = list(south_park_lines.columns)

for col in sp_cols:
    print(f"The data type of column {col} is {type(south_park_lines[col][0])}")

The data type of column season is <class 'str'>
The data type of column episode is <class 'str'>
The data type of column character is <class 'str'>
The data type of column line is <class 'str'>


Let's check why the first two columns might be strings

In [10]:
south_park_lines.season.unique()

array(['10', 'Season', '11', '12', '13', '14', '15', '16', '17', '18',
       '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)

In [11]:
south_park_lines = south_park_lines[south_park_lines.season != 'Season']

In [12]:
south_park_lines.shape

(70879, 4)

Lets change datatype of first two columns to int

In [13]:
south_park_lines['season'] = pd.to_numeric(south_park_lines['season'])

In [14]:
type(south_park_lines.season[1])

numpy.int64

In [15]:
south_park_lines['episode'] = pd.to_numeric(south_park_lines['episode'])

Finally, let's remove the '\n' at the end of each line

In [16]:
south_park_lines.line = south_park_lines.line.replace('\n','', regex=True)

Now we can create new dataframes and load each one to the corresponding MongoDB collection. The messages collection will have the south_park_lines dataframe, but we need two more: characters and episodes

Let's start with the episodes collection

In [17]:
sp_episodes = pd.DataFrame(data = south_park_lines.groupby(['season', 'episode']).size().reset_index(name='number_lines'))

In [18]:
sp_episodes.head()

Unnamed: 0,season,episode,number_lines
0,1,1,391
1,1,2,297
2,1,3,286
3,1,4,364
4,1,5,314


In [19]:
sp_episodes.sort_values(['season', 'episode'])

Unnamed: 0,season,episode,number_lines
0,1,1,391
1,1,2,297
2,1,3,286
3,1,4,364
4,1,5,314
...,...,...,...
252,18,6,240
253,18,7,305
254,18,8,250
255,18,9,250


Now we can create a dataframe for characters

In [20]:
sp_characters = pd.DataFrame(data = south_park_lines.character.unique(), columns = ['name'])

In [21]:
sp_characters

Unnamed: 0,name
0,Stan
1,Kyle
2,Chef
3,Mrs. Garrison
4,Cartman
...,...
3944,Male Voice
3945,AA Speaker
3946,Father Barnes
3947,Cardinal Mallory


We can finally load these into MongoDB

In [22]:
character_col.insert_many(sp_characters.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f7f6fd9fc40>

In [23]:
episodes_col.insert_many(sp_episodes.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f7f717f0e00>

In [24]:
message_col.insert_many(south_park_lines.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f7f70f10c40>