# NATURAL LANGUAGE PROCESSING (Basics):

## Contents:
- Data Description
- Data Statistics
- Text Cleaning
- NLP Pipeline



## import basic libraries of Regular Python

In [1]:
import pandas as pd
import numpy as np
import re
import time

## import dataset 

[Data Source Kaggle](https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data)

In [2]:
df = pd.read_csv("train.tsv",sep='\t')
df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


### TSV vs CSV
```Tab Separated Values (TSV) and Comma Separated Values (CSV) are the two file type extensions used to load a set of data. A .tsv file will have tab separated values whereas .csv file has comma separated fields. These types of files are used to get raw data for many purposes possible. Most of the Text Editors can read both the type of files. TSV extension files are more efficient for many applications in programming languages like javascript, tensorflow etc. Also, in a TSV file rows and columns are easily distinguishable. CSV files are used more in fields like machine learning, data analysis, deep learning.```

## Data Statistics:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [4]:
df.describe()

Unnamed: 0,PhraseId,SentenceId,Sentiment
count,156060.0,156060.0,156060.0
mean,78030.5,4079.732744,2.063578
std,45050.785842,2502.764394,0.893832
min,1.0,1.0,0.0
25%,39015.75,1861.75,2.0
50%,78030.5,4017.0,2.0
75%,117045.25,6244.0,3.0
max,156060.0,8544.0,4.0


In [5]:
df.columns

Index(['PhraseId', 'SentenceId', 'Phrase', 'Sentiment'], dtype='object')

In [6]:
non_null = df['Phrase'].notna().count()
total = df['Phrase'].count()
print("Total counts of  values in Phrase column is ",total)
print("Total counts of not null values in Phrase column is ",non_null)

Total counts of  values in Phrase column is  156060
Total counts of not null values in Phrase column is  156060


#### comment
there is no any null cell in Phrase column

## Data Description:

This tsv file contains the phrases and their associated sentiment labels.<br>
The sentiment labels are:<br>

- 0 - negative
- 1 - somewhat negative
- 2 - neutral
- 3 - somewhat positive
- 4 - positive

here we will pass only data of sentiment label 0 and 4 from NLP pipeline<br>

now we will create a separate dataframe that contains only phrases with sentiment labels 0 and 4

In [7]:
df2 = pd.concat(
[
    df[df.Sentiment==0],
    df[df.Sentiment==4]
])[["Phrase","Sentiment"]]
print(len(df2))

16278


In [8]:
df2.head(100)

Unnamed: 0,Phrase,Sentiment
101,would have a hard time sitting through this one,0
103,have a hard time sitting through this one,0
157,Aggressive self-glorification and a manipulati...,0
159,self-glorification and a manipulative whitewash,0
201,Trouble Every Day is a plodding mess .,0
...,...,...
3010,ugly to look at and not a Hollywood product,0
3030,bogged down in earnest dramaturgy,0
3172,cheap,0
3266,"Violent , vulgar and forgettably",0


## Import packages for NLP:

In [9]:
import unidecode
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
#pip install autocorrect
from autocorrect import Speller
import string
import timeit
from bs4 import BeautifulSoup

[nltk_data] Downloading package punkt to C:\Users\Abdul
[nltk_data]     Sami\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Abdul Sami\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to C:\Users\Abdul
[nltk_data]     Sami\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Abdul
[nltk_data]     Sami\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Abdul
[nltk_data]     Sami\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Little Description of Each imported Package:


- __unidecode:__ <br>
The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.<br>
- __nltk.corpus:__ <br>
NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats.<br>
- __word_tokenize:__ <br>
We use the word_tokenize() method to split a sentence into tokens or words<br>
- __WordNetLemmatizer__:<br>
Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words.<br>



## Text Cleaning:

### Remove newlines & tabs

In [10]:
def remove_newlines_tabs(text):
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text
print("input: ", "This is her \\ first day at this place.\n Please,\t Be nice to her.\\n")
print("Output: "),remove_newlines_tabs("This is her \\ first day at this place.\n Please,\t Be nice to her.\\n")

input:  This is her \ first day at this place.
 Please,	 Be nice to her.\n
Output: 


(None, 'This is her   first day at this place.  Please,  Be nice to her. ')

### Remove Links

In [11]:
def remove_links(text):
    # Removing all the occurrences of links that starts with https
    remove_https = re.sub(r'http\S+', '', text)
    # Remove all the occurrences of text that ends with .com
    remove_com = re.sub(r"\ [A-Za-z]*\.com", " ", remove_https)
    return remove_com

remove_links(" website: catster.com  visit: https://catster.com//how-to-feed-cats adasfas ad as")


' website:   visit:  adasfas ad as'

### Remove WhiteSpaces

In [12]:
def remove_whitespace(text):
    pattern = re.compile(r'\s+') 
    Without_whitespace = re.sub(pattern, ' ', text)
    # There are some instances where there is no space after '?' & ')', 
    # So I am replacing these with one space so that It will not consider two words as one token.
    text = Without_whitespace.replace('?', ' ? ').replace(')', ') ')
    return text    

remove_whitespace("How   are  \t you \n   doing? (pakistan)ABC")


'How are you doing ?  (pakistan) ABC'

### Remove Accented Characters

In [13]:
def accented_characters_removal(text):
    # Remove accented characters from text using unidecode.
    # Unidecode() - It takes unicode data & tries to represent it to ASCII characters. 
    text = unidecode.unidecode(text)
    return text

accented_characters_removal("Málaga, àéêöhello")

'Malaga, aeeohello'

### Case Conversion

In [14]:
def lower_casing_text(text):
    # Convert text to lower case
    # lower() - It converts all upperase letter of given string to lowercase.
    text = text.lower()
    return text

lower_casing_text("Pakistan Zinda BAD!")

'pakistan zinda bad!'

### Reduce repeated characters and punctuations

In [15]:
def reducing_incorrect_character_repeatation(text):
    # Pattern matching for all case alphabets
    Pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)
    
    # Limiting all the  repeatation to two characters.
    Formatted_text = Pattern_alpha.sub(r"\1\1", text) 
    
    # Pattern matching for all the punctuations that can occur
    Pattern_Punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')
    
    # Limiting punctuations in previously formatted string to only one.
    Combined_Formatted = Pattern_Punct.sub(r'\1', Formatted_text)
    
    # The below statement is replacing repeatation of spaces that occur more than two times with that of one occurrence.
    Final_Formatted = re.sub(' {2,}',' ', Combined_Formatted)
    return Final_Formatted

reducing_incorrect_character_repeatation("Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)")

'Reallyy, Greeaatt !?.;:)'

### Expand contraction words

In [16]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
}
# The code for expanding contraction words
def expand_contractions(text, contraction_mapping =  CONTRACTION_MAP):
    # Tokenizing text into tokens.
    list_Of_tokens = text.split(' ')

    # Checking for whether the given token matches with the Key & replacing word with key's value.
    
    # Check whether Word is in lidt_Of_tokens or not.
    for Word in list_Of_tokens: 
        # Check whether found word is in dictionary "Contraction Map" or not as a key. 
         if Word in CONTRACTION_MAP: 
                # If Word is present in both dictionary & list_Of_tokens, replace that word with the key value.
                list_Of_tokens = [item.replace(Word, CONTRACTION_MAP[Word]) for item in list_Of_tokens]
                
    # Converting list of tokens to String.
    String_Of_tokens = ' '.join(str(e) for e in list_Of_tokens) 
    return String_Of_tokens    

expand_contractions("ain't , aren't , can't , can't've")


"is not , are not , cannot , cannot've"

### Remove special characters

In [17]:
def removing_special_characters(text):
    # The formatted text after removing not necessary punctuations.
    Formatted_Text = re.sub(r"[^a-zA-Z0-9:$-,%.?!]+", ' ', text) 
    # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
    return Formatted_Text

removing_special_characters(" Hello, K-a-j-a-l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) ")

' Hello, K a j a l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) '

## NLP pipeline:

In [18]:
#Example:
# Variable that stores the whole paragraph
text = "A phrase is made up of a head (or headword)—which determines the grammatical nature of the unit—and one or more optional modifiers. Phrases may contain other phrases inside them."

# Tokenize paragraph into sentences
sentences = nltk.sent_tokenize(text)

# Print out sentences
for sentence in sentences:
	print(sentence)

A phrase is made up of a head (or headword)—which determines the grammatical nature of the unit—and one or more optional modifiers.
Phrases may contain other phrases inside them.


### Word Tokenization:

In [19]:
from nltk.tokenize import word_tokenize 

def token(text):
    tokens = word_tokenize(text)
    return tokens
    
token("A phrase is made up of a head (or headword)—which determines the grammatical nature of the unit—and one or more optional modifiers. Phrases may contain other phrases inside them.")

['A',
 'phrase',
 'is',
 'made',
 'up',
 'of',
 'a',
 'head',
 '(',
 'or',
 'headword',
 ')',
 '—which',
 'determines',
 'the',
 'grammatical',
 'nature',
 'of',
 'the',
 'unit—and',
 'one',
 'or',
 'more',
 'optional',
 'modifiers',
 '.',
 'Phrases',
 'may',
 'contain',
 'other',
 'phrases',
 'inside',
 'them',
 '.']

### POS Tagging:

In [20]:
from nltk import pos_tag 
def partsofspeech(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return pos_tags



### Text Lemmatization:

In [21]:
def lemmat(text):
    tokens = word_tokenize(text) # tokens
    lemmatizer = nltk.stem.WordNetLemmatizer() # lemmatizer
    lemma = [lemmatizer.lemmatize(w) for w in tokens]
    tokens_without_sw = [word for word in lemma if not word in stopwords.words('english')]
    return tokens_without_sw


### Stopwords

In [22]:
def stop(text):
    tokens_without_sw = [word for word in text if not word in stopwords.words('english')]
    return tokens_without_sw


## Putting all in single function

In [23]:
def text_preprocessing(text, newlines_tabs=True,links=True,extra_whitespace=True,accented_chars=True,lowercase=True, contractions=True, punctuations=True,special_chars=True,tokenization=False,lemmatization_allow = True,stop_words=False):
    
    if newlines_tabs==True:
        data = remove_newlines_tabs(text)
    if links == True:
        data = remove_links(text)
    if extra_whitespace==True:
        data = remove_whitespace(text)
    if accented_chars==True:
        data = accented_characters_removal(text)
    if lowercase==True:
        data = lower_casing_text(text)
    if contractions==True:
        data = expand_contractions(text)
    if punctuations==True:
        data = reducing_incorrect_character_repeatation(text)
    if special_chars==True:
        data = removing_special_characters(text)
    if lemmatization_allow==True:
        data = lemmat(text)
    return data    
        
    

### Now apply function on dataset

In [24]:
df2.tail(100)

Unnamed: 0,Phrase,Sentiment
154379,wonderfully speculative,4
154423,Nonchalantly freaky and uncommonly pleasurable...,4
154429,", Warm Water may well be the year 's best and ...",4
154430,Warm Water may well be the year 's best and mo...,4
154432,may well be the year 's best and most unpredic...,4
...,...,...
155946,is laughingly enjoyable,4
155955,a unique culture that is presented with univer...,4
155961,with universal appeal,4
156007,really do a great job of anchoring the charact...,4


In [25]:

%%time
df2['Processed_Phrase']= df2.Phrase.apply(text_preprocessing)


Wall time: 1min 11s


In [26]:
df2.tail(100)

Unnamed: 0,Phrase,Sentiment,Processed_Phrase
154379,wonderfully speculative,4,"[wonderfully, speculative]"
154423,Nonchalantly freaky and uncommonly pleasurable...,4,"[Nonchalantly, freaky, uncommonly, pleasurable..."
154429,", Warm Water may well be the year 's best and ...",4,"[,, Warm, Water, may, well, year, 's, best, un..."
154430,Warm Water may well be the year 's best and mo...,4,"[Warm, Water, may, well, year, 's, best, unpre..."
154432,may well be the year 's best and most unpredic...,4,"[may, well, year, 's, best, unpredictable, com..."
...,...,...,...
155946,is laughingly enjoyable,4,"[laughingly, enjoyable]"
155955,a unique culture that is presented with univer...,4,"[unique, culture, presented, universal, appeal]"
155961,with universal appeal,4,"[universal, appeal]"
156007,really do a great job of anchoring the charact...,4,"[really, great, job, anchoring, character, emo..."


In [27]:
df2.head(102)

Unnamed: 0,Phrase,Sentiment,Processed_Phrase
101,would have a hard time sitting through this one,0,"[would, hard, time, sitting, one]"
103,have a hard time sitting through this one,0,"[hard, time, sitting, one]"
157,Aggressive self-glorification and a manipulati...,0,"[Aggressive, self-glorification, manipulative,..."
159,self-glorification and a manipulative whitewash,0,"[self-glorification, manipulative, whitewash]"
201,Trouble Every Day is a plodding mess .,0,"[Trouble, Every, Day, plodding, mess, .]"
...,...,...,...
3172,cheap,0,[cheap]
3266,"Violent , vulgar and forgettably",0,"[Violent, ,, vulgar, forgettably]"
3270,vulgar,0,[vulgar]
3475,at the expense of those who paid for it and th...,0,"[expense, paid, pay, see]"


In [28]:
df2 = df2.reset_index()
df2['Processed_Phrase'][0]

['would', 'hard', 'time', 'sitting', 'one']

In [29]:
df2['Processed_Phrase1'] = df2.Processed_Phrase.apply(lambda x:" ".join(x))
df2

Unnamed: 0,index,Phrase,Sentiment,Processed_Phrase,Processed_Phrase1
0,101,would have a hard time sitting through this one,0,"[would, hard, time, sitting, one]",would hard time sitting one
1,103,have a hard time sitting through this one,0,"[hard, time, sitting, one]",hard time sitting one
2,157,Aggressive self-glorification and a manipulati...,0,"[Aggressive, self-glorification, manipulative,...",Aggressive self-glorification manipulative whi...
3,159,self-glorification and a manipulative whitewash,0,"[self-glorification, manipulative, whitewash]",self-glorification manipulative whitewash
4,201,Trouble Every Day is a plodding mess .,0,"[Trouble, Every, Day, plodding, mess, .]",Trouble Every Day plodding mess .
...,...,...,...,...,...
16273,155946,is laughingly enjoyable,4,"[laughingly, enjoyable]",laughingly enjoyable
16274,155955,a unique culture that is presented with univer...,4,"[unique, culture, presented, universal, appeal]",unique culture presented universal appeal
16275,155961,with universal appeal,4,"[universal, appeal]",universal appeal
16276,156007,really do a great job of anchoring the charact...,4,"[really, great, job, anchoring, character, emo...",really great job anchoring character emotional...


## Word Embedding or Text Vectorization

There are various techniques used to do word embedding:
- One Hot Encoding
- Count Vectorizer
- Bag of Words
- N-grams Vectorization
- TF-IDF Vectorization

### Using TF-IDF Vectorization:

In [30]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
vocabulary_ = []
idf_ = []
vectors = []
# create the transform
vectorizer = TfidfVectorizer()

for i in df2['Processed_Phrase'].index:
    vectorizer.fit(df2['Processed_Phrase'][i])
    vocabulary_.append(vectorizer.vocabulary_)
    idf_.append(vectorizer.idf_)
    vector = vectorizer.transform(df2['Processed_Phrase'][i])
    vectors.append(vector.toarray())

df2.insert(4,"Vocabularry",vocabulary_,True)
df2.insert(5,"IDF",idf_,True)


Wall time: 33.8 s


In [31]:
df2.insert(6,"Vectors",vectors,True)

In [32]:
df2.tail(100)

Unnamed: 0,index,Phrase,Sentiment,Processed_Phrase,Vocabularry,IDF,Vectors,Processed_Phrase1
16178,154379,wonderfully speculative,4,"[wonderfully, speculative]","{'wonderfully': 1, 'speculative': 0}","[1.4054651081081644, 1.4054651081081644]","[[0.0, 1.0], [1.0, 0.0]]",wonderfully speculative
16179,154423,Nonchalantly freaky and uncommonly pleasurable...,4,"[Nonchalantly, freaky, uncommonly, pleasurable...","{'nonchalantly': 4, 'freaky': 2, 'uncommonly':...","[3.0794415416798357, 3.0794415416798357, 3.079...","[[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,...","Nonchalantly freaky uncommonly pleasurable , W..."
16180,154429,", Warm Water may well be the year 's best and ...",4,"[,, Warm, Water, may, well, year, 's, best, un...","{'warm': 4, 'water': 5, 'may': 2, 'well': 6, '...","[2.791759469228055, 2.791759469228055, 2.79175...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0....",", Warm Water may well year 's best unpredictab..."
16181,154430,Warm Water may well be the year 's best and mo...,4,"[Warm, Water, may, well, year, 's, best, unpre...","{'warm': 4, 'water': 5, 'may': 2, 'well': 6, '...","[2.7047480922384253, 2.7047480922384253, 2.704...","[[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0], [0....",Warm Water may well year 's best unpredictable...
16182,154432,may well be the year 's best and most unpredic...,4,"[may, well, year, 's, best, unpredictable, com...","{'may': 2, 'well': 4, 'year': 5, 'best': 0, 'u...","[2.386294361119891, 2.386294361119891, 2.38629...","[[0.0, 0.0, 1.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0....",may well year 's best unpredictable comedy
...,...,...,...,...,...,...,...,...
16273,155946,is laughingly enjoyable,4,"[laughingly, enjoyable]","{'laughingly': 1, 'enjoyable': 0}","[1.4054651081081644, 1.4054651081081644]","[[0.0, 1.0], [1.0, 0.0]]",laughingly enjoyable
16274,155955,a unique culture that is presented with univer...,4,"[unique, culture, presented, universal, appeal]","{'unique': 3, 'culture': 1, 'presented': 2, 'u...","[2.09861228866811, 2.09861228866811, 2.0986122...","[[0.0, 0.0, 0.0, 1.0, 0.0], [0.0, 1.0, 0.0, 0....",unique culture presented universal appeal
16275,155961,with universal appeal,4,"[universal, appeal]","{'universal': 1, 'appeal': 0}","[1.4054651081081644, 1.4054651081081644]","[[0.0, 1.0], [1.0, 0.0]]",universal appeal
16276,156007,really do a great job of anchoring the charact...,4,"[really, great, job, anchoring, character, emo...","{'really': 8, 'great': 4, 'job': 5, 'anchoring...","[2.7047480922384253, 2.7047480922384253, 2.704...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]...",really great job anchoring character emotional...


In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = df2.Processed_Phrase1.values
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)

# encode document
# vector = vectorizer.transform(text)



### Now change 4 to 1 (Positive)
### 0 to 0 (Negative)

In [34]:
df2.Sentiment = df2.Sentiment.map({4:1,
                                  0:0})
df2.Sentiment.value_counts(dropna=False)

1    9206
0    7072
Name: Sentiment, dtype: int64

### save to csv format

In [35]:
Cleaned_Data = df2.to_csv('Cleaned_Data.csv', index = False)

### save to pickel format

In [36]:
df2.to_pickle("Cleaned_Data.pkl")