# Text Data Cleaning

**This page will show you how I cleaned up the text data that I gathered from Twitter API. The cleaning of text data is very different from that of record data. The goal of text data cleanup is to make raw text standardized and uniform in format for later analysis. The text data to be cleaned on this page is gather from twitter api. So each unit in the data is a tweet posted by the user.**


You can access the raw json file [here](https://github.com/anly501/anly-501-project-liumingqian0511/tree/main/data/00-raw-data/twitter_data). 

In [133]:
import nltk;
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string 
import json
from nltk.sentiment import SentimentIntensityAnalyzer
import json
import pandas as pd
from nltk.corpus import wordnet as wn
import os, json
import glob


### Read in Json File
================================================================================


**The first step is to read in the json file that needs to be cleaned and open it with the pd.read_json() function. One good thing about the pd.read_json() function is that every json file opened with it automatically converts to a dataframe which is more operable. I assigned the file to 'health_insurance_df'. I created an 'ID' column that take value from 1 to the length of the data frame for later use. After adding 'ID' column, we can see that the data frame consists of 6 column, we are going to focus on the 'text' column for cleaning purpose.**

In [184]:
df1 = pd.read_json('/Users/liumingqian/anly-501-project-liumingqian0511/data/00-raw-data/twitter_data/medicaid.json')
df2 = pd.read_json('/Users/liumingqian/anly-501-project-liumingqian0511/data/00-raw-data/twitter_data/health_insurance.json')
df3 = pd.read_json('/Users/liumingqian/anly-501-project-liumingqian0511/data/00-raw-data/twitter_data/health_care.json')
df4 = pd.concat([df1,df2],ignore_index=True)
health_insurance_df = pd.concat([df4,df3],ignore_index=True)
#health_insurance_df.drop_duplicates(subset='text', inplace=True)
health_insurance_df = health_insurance_df.assign(ID = list(range(1,len(health_insurance_df)+1)))
health_insurance_df.shape


(897, 6)

### Filter Text
================================================================================

**The first step in text cleaning is to remove the stop words from the text. Stopwords are the words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, I, it etc. Such words are already captured this in corpus named corpus. We first download it to our python environment. I wrote a define function to loop through each tweets and to filter out stopwords and lowercase all the letters. Applying this function to our dataframe, we can see that the 'text' column is now stopwords-free.**

In [175]:
def filterStopwords(df):
    for val, tweet in enumerate(df['text']):
        new_text=""
        for word in nltk.tokenize.word_tokenize(tweet):
            if word not in nltk.corpus.stopwords.words('english'):
                if word in [".",",","!","?",":",";"]:
                    #remove the last space
                    new_text = new_text[0:-1]+word+" "
                else: 
                    #add a space
                    new_text+=word.lower()+" "
        df['text'][val] = new_text

In [185]:
filterStopwords(health_insurance_df)
health_insurance_df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][val] = new_text


Unnamed: 0,id,created_at,author_id,lang,text,ID
0,1574935190694428672,2022-09-28 01:33:04+00:00,139758483,en,rt @ trumpstaxes: south dakota issue: medicaid...,1
1,1574935188223959040,2022-09-28 01:33:04+00:00,1066108439787642880,en,rt @ reverendwarnock: herschel walker said lou...,2
2,1574935177783934976,2022-09-28 01:33:01+00:00,386543721,en,rt @ staceyabrams: brian kemp far-right extrem...,3
3,1574935134783950848,2022-09-28 01:32:51+00:00,1395715787009236992,en,rt @ danaparish: biogen pays $ 900m settle doc...,4
4,1574935134369091584,2022-09-28 01:32:51+00:00,3442600336,en,rt @ staceyabrams: brian kemp far-right extrem...,5
5,1574935077632765952,2022-09-28 01:32:38+00:00,1572043481551446016,en,rt @ staceyabrams: brian kemp far-right extrem...,6
6,1574935007293894656,2022-09-28 01:32:21+00:00,1316748480149368832,en,rt @ safetynetplans: the bipartisan safer comm...,7
7,1574934930198392832,2022-09-28 01:32:02+00:00,87088452,en,rt @ staceyabrams: brian kemp far-right extrem...,8
8,1574934904705404928,2022-09-28 01:31:56+00:00,1159517493578235904,en,rt @ staceyabrams: brian kemp far-right extrem...,9
9,1574934833171927040,2022-09-28 01:31:39+00:00,1113450409056645120,en,rt @ trumpstaxes: south dakota issue: medicaid...,10


### Lemmatize the data

In [163]:
lemmatizer = WordNetLemmatizer()
def lemmatize(df):
    for i,tweet in enumerate(df['text']):
        new_text = ""
        for word in tweet.split(' '):
            if wn.synsets(word):
                new_text += lemmatizer.lemmatize(word,pos= wn.synsets(word)[0].pos()) + " "
            else:
                new_text += word + " "
        df['text'][i] = new_text

In [164]:
lemmatize(health_insurance_df)
health_insurance_df.head()
health_insurance_df.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = new_text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = new_text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = new_text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = new_text
A value is trying to be set on a copy of a slice from a DataFrame

S

(154, 6)

### Sentiment Analysis
================================================================================

**The second step of text data cleaning is to perform sentiment analysis for each tweet and output their scores. Sentiment analysis is a technique that detects the underlying sentiment in a piece of text. It is the process of classifying text as either positive, negative, or neutral. Sentiment analysis is very essential to gauge customers or users response. In the following chunks, I wrote a getSentiments() function to rate each tweets' sentiment scores in positivity, negativity, and neutrality. I converted the result from the dictionary to a data frame 'score', and also added a column 'ID' that takes the same value as the 'ID' column in the health_insurance_df. Displaying the first ten rows of the score data frame, we can see that we have four columns of values to rate the corresponting sentiment and one column of 'ID' for later use.**

In [186]:
def getSentiments(df):
    sia = SentimentIntensityAnalyzer()
    tweet_str = ""
    tweetscore = []
    for tweet in df['text']:
        tweet_str = tweet_str + " " + tweet
        score = sia.polarity_scores(tweet_str)
        tweetscore.append(score)
    return tweetscore

sentiment = getSentiments(health_insurance_df)


In [187]:

score = pd.DataFrame.from_dict(sentiment)
score = score.assign(ID = list(range(1,len(score)+1)))
score.head(10)


Unnamed: 0,neg,neu,pos,compound,ID
0,0.0,0.856,0.144,0.4019,1
1,0.0,0.836,0.164,0.7096,2
2,0.094,0.799,0.107,0.1779,3
3,0.076,0.837,0.087,0.1779,4
4,0.119,0.813,0.068,-0.5423,5
5,0.146,0.798,0.055,-0.8271,6
6,0.122,0.792,0.086,-0.5574,7
7,0.142,0.783,0.075,-0.8316,8
8,0.157,0.777,0.066,-0.9186,9
9,0.141,0.785,0.074,-0.8834,10


**Now we inner join the score data with the health_insurance_df data so that each tweet in the health_insurance dataframe will have coresponding sentiment scores.**

In [188]:
health_insurance_df = health_insurance_df.merge(score,how='inner',on='ID')
health_insurance_df['label'] =health_insurance_df[['neg','pos']].idxmax(axis=1)
health_insurance_df['label'].value_counts()
#health_insurance_df.head()


pos    836
neg     61
Name: label, dtype: int64

In [191]:
health_insurance_df.drop_duplicates(subset = 'text', inplace=True)
health_insurance_df.shape


(154, 11)

In [192]:
health_insurance_df['label'].value_counts()

pos    129
neg     25
Name: label, dtype: int64

### Tidy Dataframe
================================================================================

**By the previous step, the basic Text cleaning was done. Now we're going to finish up our data frame by renaming columns to more intuitive names, casting the data type, and adding a column to dispplay the label of of the tweet sentiment.**

In [None]:
health_insurance_df['created_at'] = health_insurance_df['created_at'].apply(lambda x: x.date)
health_insurance_df.rename(columns={'created_at':'date','lang':'language'},inplace = True)
health_insurance_df.drop(columns = ['author_id','id'],inplace = True)
health_insurance_df.head(10)

Unnamed: 0,date,language,text,ID,neg,neu,pos,compound,label
0,2022-09-28,en,trumpstaxes south dakota issue medicaid expans...,1,0.0,0.838,0.162,0.4019,neu
1,2022-09-28,en,reverendwarnock herschel walker say loud oppos...,2,0.0,0.753,0.247,0.802,neu
2,2022-09-28,en,staceyabrams brian kemp far-right extremist da...,3,0.103,0.734,0.163,0.4588,neu
3,2022-09-28,en,danaparish biogen pay 900m settle doctor kickb...,4,0.105,0.764,0.131,0.3818,neu
4,2022-09-28,en,crimewatchmpls update agellison totally seriou...,7,0.155,0.727,0.118,-0.3384,neu
5,2022-09-28,en,rfcgeneric wow surprise big pharma pay massive...,14,0.132,0.677,0.191,0.9442,neu
6,2022-09-28,en,thetnholler today holler plaza push back vandy...,15,0.139,0.661,0.2,0.9517,neu
7,2022-09-28,en,jbsgreenberg katie_thomas hi jessica katie - f...,16,0.138,0.67,0.192,0.9486,neu
8,2022-09-28,en,actiondemocrat potus majority vote promise exp...,19,0.129,0.68,0.191,0.9709,neu
9,2022-09-28,en,luvurneighbor68 nema yall put print ... rumor ...,20,0.116,0.705,0.179,0.9769,neu


### Vectorizing Text Data
================================================================================

**In programming, a vector is a data structure that is similar to a list or an array. For the purpose of input representation, it is simply a succession of values, with the number of values representing the vector’s “dimensionality.” Text Vectorization is the process of converting text into numerical representation. I extracted the text from each tweet and save them both to a string for wordcloud and to a list for vectorizing. Using the CountVectorizer() function from the sklearn library, we can convert the corpus to a dense matrix. I transformed the matrix to a data frame which each column take a word. This gives us a 300 x 1026 huge data frame.**

In [None]:
corpus_str = ""
corpus = []
health_insurance_df['text'].apply(lambda x: corpus.append(x))
corpus_str = corpus_str.join(corpus)

In [None]:
vectorizer=CountVectorizer()
Xs  =  vectorizer.fit_transform(corpus)
X = np.array(Xs.todense())
col_names=vectorizer.get_feature_names_out()
vec = pd.DataFrame(X,columns=col_names)
vec['label'] = health_insurance_df['label']
vec.to_csv('/Users/liumingqian/anly-501-project-liumingqian0511/data/01-modified-data/vec.csv')
vec.head()

Unnamed: 0,10,100,100m,11,15,1828,1wtckjj8iz,2020,30,43,...,yall,year,years,yes,you,young,your,zirui,zmynknjeur,label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,neu
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neu
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neu
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neu
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neu


**Keep working the vectorized data frame, I summed up the value for each column and sortted them in descending order. By doing this, we are able to get the word frequency in a more intuitional way.**

In [None]:
sum_words = Xs.sum(axis=0) 
words_freq = [[word, sum_words[0, idx]] for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
words_freq_df = pd.DataFrame(words_freq,columns=['word','Frequency'])
words_freq_df.head(10)


Unnamed: 0,word,Frequency
0,health,30
1,insurance,21
2,care,13
3,and,13
4,the,13
5,in,12
6,medicaid,11
7,rt,11
8,pay,10
9,to,10
