# Text Data Cleaning

**This page will show you how I cleaned up the text data that I gathered from Twitter API. The cleaning of text data is very different from that of record data. The goal of text data cleanup is to make raw text standardized and uniform in format for later analysis. The text data to be cleaned on this page is gather from twitter api. So each unit in the data is a tweet posted by the user.**


You can access the raw json file [here](https://github.com/anly501/anly-501-project-liumingqian0511/tree/main/data/00-raw-data/twitter_data). 

In [169]:
import nltk;
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string 
import json
from nltk.sentiment import SentimentIntensityAnalyzer
import json
import pandas as pd


### Read in Json File
================================================================================


**The first step is to read in the json file that needs to be cleaned and open it with the pd.read_json() function. One good thing about the pd.read_json() function is that every json file opened with it automatically converts to a dataframe which is more operable. I assigned the file to 'health_insurance_df'. I created an 'ID' column that take value from 1 to the length of the data frame for later use. After adding 'ID' column, we can see that the data frame consists of 6 column, we are going to focus on the 'text' column for cleaning purpose.**

In [170]:
health_insurance_df = pd.read_json('/Users/liumingqian/anly-501-project-liumingqian0511/data/00-raw-data/twitter_data/tweetshealth insurance plan.json')
print(type(health_insurance_df))
health_insurance_df = health_insurance_df.assign(ID = list(range(1,301)))
health_insurance_df.head(10)


<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,created_at,lang,author_id,id,text,ID
0,2022-09-28 01:16:11+00:00,en,1544004181886996480,1574930940291579904,@winter_canada I got 1 so far and have a diffe...,1
1,2022-09-28 01:04:48+00:00,en,950584668708945920,1574928074285584384,Did you know that you are covered by MediShiel...,2
2,2022-09-28 01:00:21+00:00,en,2986463136,1574926957220745216,Everyone deserves to have this kind of health ...,3
3,2022-09-28 00:30:00+00:00,en,1274040016704962560,1574919318693089280,Your total compensation is more than just your...,4
4,2022-09-28 00:29:57+00:00,en,80917722,1574919305912848384,Walmart is teaming up with a fertility startup...,5
5,2022-09-28 00:25:35+00:00,en,1558902826188906496,1574918206623547392,Best/Top 10 health insurance companies in Indi...,6
6,2022-09-28 00:16:02+00:00,en,1623244873,1574915800825200640,Enjoy your fall activities without worries kno...,7
7,2022-09-27 23:55:09+00:00,en,24280970,1574910545526157312,Looking to speak to someone who is aged betwee...,8
8,2022-09-27 23:54:17+00:00,en,890921916260896768,1574910329884409856,RT @Ampersand48: Subsidized housing and subsid...,9
9,2022-09-27 23:49:19+00:00,en,636585149,1574909077670469632,@glen_mcgregor You understand there's a bit of...,10


### Filter Text
================================================================================

**The first step in text cleaning is to remove the stop words from the text. Stopwords are the words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, I, it etc. Such words are already captured this in corpus named corpus. We first download it to our python environment. I wrote a define function to loop through each tweets and to filter out stopwords and lowercase all the letters. Applying this function to our dataframe, we can see that the 'text' column is now stopwords-free.**

In [171]:
def filterStopwords(df):
    for val, tweet in enumerate(df['text']):
        new_text=""
        for word in nltk.tokenize.word_tokenize(tweet):
            if word not in nltk.corpus.stopwords.words('english'):
                if word in [".",",","!","?",":",";"]:
                    #remove the last space
                    new_text = new_text[0:-1]+word+" "
                else: 
                    #add a space
                    new_text+=word.lower()+" "
        df['text'][val] = new_text

In [172]:
filterStopwords(health_insurance_df)
health_insurance_df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][val] = new_text


Unnamed: 0,created_at,lang,author_id,id,text,ID
0,2022-09-28 01:16:11+00:00,en,1544004181886996480,1574930940291579904,@ winter_canada i got 1 far different one sche...,1
1,2022-09-28 01:04:48+00:00,en,950584668708945920,1574928074285584384,did know covered medishield life singapore cit...,2
2,2022-09-28 01:00:21+00:00,en,2986463136,1574926957220745216,everyone deserves kind health insurance. we ’ ...,3
3,2022-09-28 00:30:00+00:00,en,1274040016704962560,1574919318693089280,your total compensation salary. when receive j...,4
4,2022-09-28 00:29:57+00:00,en,80917722,1574919305912848384,walmart teaming fertility startup offer benefi...,5
5,2022-09-28 00:25:35+00:00,en,1558902826188906496,1574918206623547392,best/top 10 health insurance companies india 2...,6
6,2022-09-28 00:16:02+00:00,en,1623244873,1574915800825200640,enjoy fall activities without worries knowing ...,7
7,2022-09-27 23:55:09+00:00,en,24280970,1574910545526157312,looking speak someone aged 25-31 years old sti...,8
8,2022-09-27 23:54:17+00:00,en,890921916260896768,1574910329884409856,rt @ ampersand48: subsidized housing subsidize...,9
9,2022-09-27 23:49:19+00:00,en,636585149,1574909077670469632,@ glen_mcgregor you understand 's bit differen...,10


### Sentiment Analysis
================================================================================

**The second step of text data cleaning is to perform sentiment analysis for each tweet and output their scores. Sentiment analysis is a technique that detects the underlying sentiment in a piece of text. It is the process of classifying text as either positive, negative, or neutral. Sentiment analysis is very essential to gauge customers or users response. In the following chunks, I wrote a getSentiments() function to rate each tweets' sentiment scores in positivity, negativity, and neutrality. I converted the result from the dictionary to a data frame 'score', and also added a column 'ID' that takes the same value as the 'ID' column in the health_insurance_df. Displaying the first ten rows of the score data frame, we can see that we have four columns of values to rate the corresponting sentiment and one column of 'ID' for later use.**

In [173]:
def getSentiments(df):
    sia = SentimentIntensityAnalyzer()
    tweet_str = ""
    tweetscore = []
    for tweet in df['text']:
        tweet_str = tweet_str + " " + tweet
        score = sia.polarity_scores(tweet_str)
        tweetscore.append(score)
    return tweetscore

sentiment = getSentiments(health_insurance_df)


In [174]:

score = pd.DataFrame.from_dict(sentiment)
score = score.assign(ID = list(range(1,301)))
score.head(10)

Unnamed: 0,neg,neu,pos,compound,ID
0,0.048,0.952,0.0,-0.1027,1
1,0.028,0.972,0.0,-0.1027,2
2,0.02,0.932,0.048,0.4588,3
3,0.014,0.9,0.086,0.7845,4
4,0.012,0.85,0.138,0.926,5
5,0.01,0.84,0.149,0.9565,6
6,0.008,0.804,0.188,0.9834,7
7,0.007,0.823,0.17,0.9834,8
8,0.014,0.828,0.158,0.9828,9
9,0.012,0.819,0.17,0.9895,10


**Now we inner join the score data with the health_insurance_df data so that each tweet in the health_insurance dataframe will have coresponding sentiment scores.**

In [175]:
health_insurance_df = health_insurance_df.merge(score,how='inner')
health_insurance_df['label'] =health_insurance_df[['neg','neu','pos']].idxmax(axis=1)
health_insurance_df.head()


Unnamed: 0,created_at,lang,author_id,id,text,ID,neg,neu,pos,compound,label
0,2022-09-28 01:16:11+00:00,en,1544004181886996480,1574930940291579904,@ winter_canada i got 1 far different one sche...,1,0.048,0.952,0.0,-0.1027,neu
1,2022-09-28 01:04:48+00:00,en,950584668708945920,1574928074285584384,did know covered medishield life singapore cit...,2,0.028,0.972,0.0,-0.1027,neu
2,2022-09-28 01:00:21+00:00,en,2986463136,1574926957220745216,everyone deserves kind health insurance. we ’ ...,3,0.02,0.932,0.048,0.4588,neu
3,2022-09-28 00:30:00+00:00,en,1274040016704962560,1574919318693089280,your total compensation salary. when receive j...,4,0.014,0.9,0.086,0.7845,neu
4,2022-09-28 00:29:57+00:00,en,80917722,1574919305912848384,walmart teaming fertility startup offer benefi...,5,0.012,0.85,0.138,0.926,neu


### Tidy Dataframe
================================================================================

**By the previous step, the basic Text cleaning was done. Now we're going to finish up our data frame by renaming columns to more intuitive names, casting the data type, and adding a column to dispplay the label of of the tweet sentiment.**

In [176]:
health_insurance_df['created_at'] = health_insurance_df['created_at'].apply(lambda x: x.date)
health_insurance_df.rename(columns={'created_at':'date','lang':'language'},inplace = True)
health_insurance_df.drop(columns = ['author_id','id'],inplace = True)
health_insurance_df.head(10)

Unnamed: 0,date,language,text,ID,neg,neu,pos,compound,label
0,2022-09-28,en,@ winter_canada i got 1 far different one sche...,1,0.048,0.952,0.0,-0.1027,neu
1,2022-09-28,en,did know covered medishield life singapore cit...,2,0.028,0.972,0.0,-0.1027,neu
2,2022-09-28,en,everyone deserves kind health insurance. we ’ ...,3,0.02,0.932,0.048,0.4588,neu
3,2022-09-28,en,your total compensation salary. when receive j...,4,0.014,0.9,0.086,0.7845,neu
4,2022-09-28,en,walmart teaming fertility startup offer benefi...,5,0.012,0.85,0.138,0.926,neu
5,2022-09-28,en,best/top 10 health insurance companies india 2...,6,0.01,0.84,0.149,0.9565,neu
6,2022-09-28,en,enjoy fall activities without worries knowing ...,7,0.008,0.804,0.188,0.9834,neu
7,2022-09-27,en,looking speak someone aged 25-31 years old sti...,8,0.007,0.823,0.17,0.9834,neu
8,2022-09-27,en,rt @ ampersand48: subsidized housing subsidize...,9,0.014,0.828,0.158,0.9828,neu
9,2022-09-27,en,@ glen_mcgregor you understand 's bit differen...,10,0.012,0.819,0.17,0.9895,neu


### Vectorizing Text Data
================================================================================

**In programming, a vector is a data structure that is similar to a list or an array. For the purpose of input representation, it is simply a succession of values, with the number of values representing the vector’s “dimensionality.” Text Vectorization is the process of converting text into numerical representation. I extracted the text from each tweet and save them both to a string for wordcloud and to a list for vectorizing. Using the CountVectorizer() function from the sklearn library, we can convert the corpus to a dense matrix. I transformed the matrix to a data frame which each column take a word. This gives us a 300 x 1026 huge data frame.**

In [177]:
corpus_str = ""
corpus = []
health_insurance_df['text'].apply(lambda x: corpus.append(x))
corpus_str = corpus_str.join(corpus)

In [178]:
vectorizer=CountVectorizer()
Xs  =  vectorizer.fit_transform(corpus)
X=np.array(Xs.todense())
col_names=vectorizer.get_feature_names_out()
vec = pd.DataFrame(X,columns=col_names)
vec.head()

Unnamed: 0,000,10,11,1fr33dom,20,200,2000s,2008,2020,2022,...,yet,ymooqbi3xa,yo52ognxt8,you,your,yzprsx9mt3,zackdunn314159,zekegary2,zero,znxpynpvhf
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


**Keep working the vectorized data frame, I summed up the value for each column and sortted them in descending order. By doing this, we are able to get the word frequency in a more intuitional way.**

In [179]:
sum_words = Xs.sum(axis=0) 
words_freq = [[word, sum_words[0, idx]] for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
words_freq_df = pd.DataFrame(words_freq,columns=['word','Frequency'])
words_freq_df.head(10)


Unnamed: 0,word,Frequency
0,plan,309
1,insurance,279
2,health,270
3,https,132
4,co,132
5,rt,87
6,benefits,63
7,the,54
8,help,51
9,care,45
