<h1><center> Data Cleaning </center></h1>

<img src=https://analyticsindiamag.com/wp-content/uploads/2018/01/data-cleaning-1280x720.png style="width: 600px;"/>

In [155]:
import pandas as pd
import re
import string
import pickle
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.stem.porter import PorterStemmer
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from autocorrect import spell

## Getting Data 

In [331]:
df = pd.read_pickle('videoInfo.pkl')
df.rename(columns=
          {"View Count": "ViewCount",
           "Comment Count": "CommentCount",
            "Uploader":"Person"
          }, inplace = True)
df.head()

Unnamed: 0,Person,VideoId,VideoTitle,Description,ViewCount,Likes,Dislikes,CommentCount,Comments
0,Emily Ann Shaheen,2mg3sFuiwRw,HOW TO BE AN ARAB GIRL!,Heyyy guys! Thank you so much for watching! Th...,80408,1971,147,640,Im Arab from iraq but my golden name is in eng...
1,Emily Ann Shaheen,Wlub9KOJBt4,ARAB GIRL STEREOTYPES!,"Thanks for watching babes! xx, Emily Ann Shahe...",27925,608,46,240,Being an Arab is great and proud also true we ...
2,nowthisisliving,D2iCOMoOkyI,LESBIAN INTERVIEWS EX BOYFRIEND,"please be kind in the comments, this boy is th...",2098024,51987,1644,2718,My ex broke up with me because she wanted to b...
3,nowthisisliving,_Uxw2X0hNGg,why we broke up,I know this is a tough video for everyone. We ...,3081184,74616,1766,9406,still high key want them to get back together ...
4,Madison Beer,-9BfaW69LSk,Madison Beer- Catch Me Cover,HEY YOUTUBE!!!!!!!!! LONG TIME NO VIDEO! so so...,920561,14418,2408,2747,Be strong That's my fav song of demi lovato I'...


Extracting only person and comments he or she received from each of the videos:

In [332]:
person_comments = df[['Person','Comments']]
person_comments.head()

Unnamed: 0,Person,Comments
0,Emily Ann Shaheen,Im Arab from iraq but my golden name is in eng...
1,Emily Ann Shaheen,Being an Arab is great and proud also true we ...
2,nowthisisliving,My ex broke up with me because she wanted to b...
3,nowthisisliving,still high key want them to get back together ...
4,Madison Beer,Be strong That's my fav song of demi lovato I'...


Some videos I chose are uploaded by uploader, but subjects are different in the content. Regardless, I would like to analyze comments in the videos. 
Hence, I am replacing the person name by subject of the video.

In [333]:
#Replacing Person with actual subject in the video
person_comments = person_comments.replace({'Person' : {'Charlie Ayee': 'Steven Assanti', 
                               'AlienRadioStation' : 'Steven Assanti',
                               'rebecca' : 'Rebecca Black',
                               'nowthisisliving':'Shannon Beveridge',
                               'RoughChop':'Richard Gale'}})
person_comments['Person'].to_frame().T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Person,Emily Ann Shaheen,Emily Ann Shaheen,Shannon Beveridge,Shannon Beveridge,Madison Beer,Madison Beer,Rebecca Black,Rebecca Black,Madeline & Eric,Madeline & Eric,Steven Assanti,Steven Assanti,Danielle Cohn,Danielle Cohn,Richard Gale,Richard Gale


**Combining subjects in one row**:

Each subject has multiple videos and comment pairs. I am interested in finding what kind of comments each uploader gets. In this light I can combine comments from multiple videos into one for a user.

In [335]:
person_comments_grouped=person_comments.groupby(['Person'])['Comments'].apply(','.join).reset_index()
person_comments_grouped

Unnamed: 0,Person,Comments
0,Danielle Cohn,Disgusting So u 13 and you had a first time Th...
1,Emily Ann Shaheen,Im Arab from iraq but my golden name is in eng...
2,Madeline & Eric,Your grocery carts are sooo small. Ours look g...
3,Madison Beer,Be strong That's my fav song of demi lovato I'...
4,Rebecca Black,My old girlfriend also had a book titled Knot ...
5,Richard Gale,Hes a bitch Richard your a n idiot mate He bul...
6,Shannon Beveridge,My ex broke up with me because she wanted to b...
7,Steven Assanti,"Steven is a Sagittarius, we be great is my guy..."


In [336]:
#Pickling grouped unclean data for later use
person_comments_grouped.to_pickle("person_comments_grouped.pkl")

In [337]:
# Looking at a sample text:
person_comments_grouped['Comments'][1][:1000]

'Im Arab from iraq but my golden name is in english 😂 You remind me so much of rclbeauty101.. I hope to marry an Arab woman one day. I never noticed how attractive I find them until I became an adult I love it! I sooooo have a necklace with my name in Arabic and I\'m very proud of it! Thank you for your videos! Much love to you from another Arab-American "sister"! I thought I\'d get some tips to look like an arabian Girl 😂😂😂😂 I\'m literally the hairiest fuckin girl in my class(ALL ARABS) and I can\'t do about it.... except mustacge and brows I’m not Arab I’m Somali \nMy name is hard to say\nI have a gold necklace \nIt says my name on it  \nLike if you got a gold \nNecklace Wanna be a arab girl? Hate Men! Abdullah Ahmed more like \nWanna be a brainwashed SJW? hate men Where did you get your necklace?!💖 Is middle east a new continent \n Since when the middle east became a new continent its not a continent... Middle East is part of Asia. I\'m arab(Algeriaaaaaa🇩🇿🇩🇿🇩🇿🇩🇿🇩🇿) and my face is co

## Cleaning Data

**Common data cleaning steps on all text:**

* Make text lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text
* Tokenize text
* Remove stop words
* Removing emojis and icons

**More data cleaning steps after tokenization:**

* Stemming Lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos

In [338]:
# Text Cleaning

def textCleaning(text):
    text = re.sub('[%s]' % re.escape(punctuation), ' ', text) #remove punctuation
    text = re.sub('\w*\d\w*', ' ', text) #remove words with numbers
    text = re.sub('[‘’“”…]', ' ', text)
    text = text.lower()  # make lowercase
    text = re.sub('\n', ' ', text) #remove new line
    return text

text_cleaner = lambda x: textCleaning(x)

In [339]:
#Removing Emojis

def removing_emojis(text):
    # Emojis pattern
    emoji_pattern = re.compile("["
                    u"\U0001F600-\U0001F64F"  # emoticons
                    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                    u"\U0001F680-\U0001F6FF"  # transport & map symbols
                    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                    u"\U00002702-\U000027B0"
                    u"\U000024C2-\U0001F251"
                    u"\U0001f926-\U0001f937"
                    u'\U00010000-\U0010ffff'
                    u"\u200d"
                    u"\u2640-\u2642"
                    u"\u2600-\u2B55"
                    u"\u23cf"
                    u"\u23e9"
                    u"\u231a"
                    u"\u3030"
                    u"\ufe0f"
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r' ', text) 

emojiRemoval = lambda x: removing_emojis(x)

In [340]:
#Lemmatization
lemma = WordNetLemmatizer()
def lemmatization(text):
    words = word_tokenize(text)
    lemmatized_text=[]
    for w in words:
        w= lemma.lemmatize(w)
        lemmatized_text.append(w)
    return ' '.join(lemmatized_text)

lemmatizor = lambda x: lemmatization(x)


#Stemming
porter = PorterStemmer()
def stemming(text):
    words = word_tokenize(text)
    stemmed_text = []
    for w in words:
        w = porter.stem(w)
        stemmed_text.append(w)
    return ' '.join(stemmed_text)

stemmor = lambda x: stemming(x)


#Spell Checking
def spellCheck(text):
    autocorrected_text = spell(text)
    return autocorrected_text

spell_checker = lambda x: spellCheck(x)

## Organizing Data 


### Corpus 

Corpus is a collection of text. Here we are creating collection of cleaned texts. 

Let us apply different text cleaning functions we created above.

For now, I am skipping lemmatization, stemming and spell checking. The reason for this is because they tend to correct per change spellings of texts. In this project, we are looking for abuse words too. We don't want such words to be changed.

In [341]:
comment_clean= person_comments_grouped['Comments'].apply(text_cleaner).apply(emojiRemoval)
               #.apply(lemmatizor).apply(stemmor).apply(spell_checker)

person_comments_cleaned= person_comments_grouped.drop(['Comments'], axis=1)
person_comments_cleaned.insert(1, 'Cleaned_Comments', comment_clean)

person_comments_cleaned

Unnamed: 0,Person,Cleaned_Comments
0,Danielle Cohn,disgusting so u and you had a first time thi...
1,Emily Ann Shaheen,im arab from iraq but my golden name is in eng...
2,Madeline & Eric,your grocery carts are sooo small ours look g...
3,Madison Beer,be strong that s my fav song of demi lovato i ...
4,Rebecca Black,my old girlfriend also had a book titled knot ...
5,Richard Gale,hes a bitch richard your a n idiot mate he bul...
6,Shannon Beveridge,my ex broke up with me because she wanted to b...
7,Steven Assanti,steven is a sagittarius we be great is my guy...


In [342]:
# Looking at a sample text:
person_comments_cleaned['Cleaned_Comments'][1][:1000]

'im arab from iraq but my golden name is in english   you remind me so much of     i hope to marry an arab woman one day  i never noticed how attractive i find them until i became an adult i love it  i sooooo have a necklace with my name in arabic and i m very proud of it  thank you for your videos  much love to you from another arab american  sister   i thought i d get some tips to look like an arabian girl   i m literally the hairiest fuckin girl in my class all arabs  and i can t do about it     except mustacge and brows i m not arab i m somali  my name is hard to say i have a gold necklace  it says my name on it   like if you got a gold  necklace wanna be a arab girl  hate men  abdullah ahmed more like  wanna be a brainwashed sjw  hate men where did you get your necklace    is middle east a new continent   since when the middle east became a new continent its not a continent    middle east is part of asia  i m arab algeriaaaaaa   and my face is covered with moles and i literally sh

In [343]:
#Pickling cleaned data for later use
person_comments_cleaned.to_pickle("person_comments_cleaned.pkl")

In [344]:
#Converting to Key Value pair
personComments_dict = dict(zip(person_comments_cleaned.Person, person_comments_cleaned.Cleaned_Comments))

# We are going to change this to key: person, comments: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ''.join(list_of_text)
    return combined_text

person_comments_cleaned_kvp = {key: [combine_text(value)] for (key, value) in personComments_dict.items()}

personComments_corpus = pd.DataFrame.from_dict(person_comments_cleaned_kvp).transpose()
personComments_corpus.columns = ['comments']
personComments_corpus = personComments_corpus.sort_index()
personComments_corpus

Unnamed: 0,comments
Danielle Cohn,disgusting so u and you had a first time thi...
Emily Ann Shaheen,im arab from iraq but my golden name is in eng...
Madeline & Eric,your grocery carts are sooo small ours look g...
Madison Beer,be strong that s my fav song of demi lovato i ...
Rebecca Black,my old girlfriend also had a book titled knot ...
Richard Gale,hes a bitch richard your a n idiot mate he bul...
Shannon Beveridge,my ex broke up with me because she wanted to b...
Steven Assanti,steven is a sagittarius we be great is my guy...


In [345]:
#Pickling corpus for later use
personComments_corpus.to_pickle('personComments_corpus.pkl')

### Document Term Matrix (DTM)

Document Term Matrix is word counts in matrix format

In [346]:
# Creating document-term matrix using CountVectorizer and excluding common English stop words
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(personComments_corpus.comments)
personComments_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
personComments_dtm.index = personComments_corpus.index
personComments_dtm

Unnamed: 0,aaa,aaaa,aaaaaaaaaaaaa,aaaaaaakward,aaw,aawww,ab,abandon,abandoned,abbara,...,يا,ياسمين,يتدلى,يديه,يستوعبها,يشبه,يعارض,ᴛʜɪs,ᴛʜᴜᴍʙɴᴀɪʟ,ᴡᴛғ
Danielle Cohn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,1
Emily Ann Shaheen,0,0,0,0,0,1,0,0,0,2,...,1,1,1,1,1,1,1,0,0,0
Madeline & Eric,1,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Madison Beer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Rebecca Black,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Richard Gale,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Shannon Beveridge,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Steven Assanti,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [347]:
personComments_dtm.to_pickle("personComments_dtm.pkl")
pickle.dump(cv, open("cv.pkl", "wb"))