# MSDS 458: Research/Programming Assignment #4 (Final Assignment): Part 1

**Management Problem**

For this final research assignment, I apply deep learning methods that we covered in the course (MSDS 458) to conduct sentiment analysis of Twitter data, a challenging yet important field of study for organizations in both public and private sectors. Twitter is a popular platform where entities at all levels—governments, businesses, country leaders, celebrities, and even the average social media user—express their opinions. The content of such tweets could represent a country's official policy or collectively, a country's public sentiment toward a particular issue. Given the massive volume of tweets generated each day—on average, [6,000 tweets are posted on Twitter every second](https://www.internetlivestats.com/twitter-statistics/) —there is immense value in being able quickly and accurately determine such sentiment values (e.g., positive or negative). I combine natural language processing (NLP) and deep learning techniques to build a robust Twitter sentiment classification model. 

**Corpus Description**

The corpus I use is Stanford University's [Sentiment140](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip). The dataset is a CSV file consisting of 1.6 million English-language tweets. The tweets are annotated using six attributes: 1. polarity of the tweet (0 = negative, 2 = neutral, 4 = positive); 2. ID; 3. date; 4. query; 5. username; 6. text.  For this project, I plan to use the polarity and text content to build a model that could take any given tweet and determine the most probable sentiment value.

**Methods**

***Text Preprocessing & Data Exploration***
<p>Given the massive number of documents in the corpus (one tweet = one document), it is important to ensure the data, particularly each tweet's text, is cleaned and tokenized properly for any follow-on modeling tasks. Regular expressions are useful in handling any emoticons or special characters (e.g., @ symbols and hashtags). As part of the data exploration step, I also generate visualizations using histograms and word clouds representing various aspects of the tweets (e.g., text content, null values, positive/negative/neutral breakdown, etc.) to gain a broad understanding of both qualitative and quantitative aspects of the corpus.</p>

***Text Vectorization***
<p>While the focus of this project is evaluating optimal deep neural network (DNN) architectures and parameters for Twitter sentiment analysis, such work is dependent on proper experimentation and implementation of various text vectorization methods. I leverage the techniques I learned in MSDS 453 for this task, which include applying word embedding approaches such as TF-IDF and Doc2vec. Since there are 1.6 million documents, it is important for me to limit the vocabulary by tuning the vectorizer hyperparameters (e.g., max_features, max_df) and exploring dimensionality reduction techniques, such as PCA.</p>
<p>I evaluate the vectors using various classifiers to ensure I have the optimal number of features in training for the classification models. I experiment with both traditional classifiers (e.g., logistic regression, random forest, and ensemble methods) and neural networks.</p>

***Building Neural Network-Based Models***
<p>The main part of this research assignment entails careful experimentation of various DNNs we covered in MSDS 458, including fully-connected dense networks, recurrent neural networks (RNN), long short-term memory networks (LSTM), and convolutional neural networks (CNN). The neural networks are built using Keras. The evaluation method to determine the best classification model consists of a strict training-and-test regimen using a crossed experimental design (e.g., maintain consistent hyperparameter settings for consistency). Vocabulary size and word embeddings also remain consistent. I experiment with various network structure designs, hyperparameter settings, and model fit methods by taking the input data (i.e., processed tweet text vectors) through single to multiple layers consisting of varying nodes/units across dense networks, RNNs, LSTM networks, and CNNs.</p>

**Evaluation**
<p>I track and compare training/testing times as well as accuracy and loss curves for train, validation, and test datasets to evaluate the deep learning models' performances. I also provide charts and plots summarizing these key metrics to visually capture which neural network structures and models yield the strongest performance, in terms of implementation time and accuracy.</p>

## Data Preparation

In [0]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
import re
import os

from google.colab import files
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from bs4 import BeautifulSoup

import gensim
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
from collections import namedtuple
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import utils
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from pprint import pprint
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import os
os.chdir('/content/drive/My Drive/MSDS Files/MSDS 458/Assignment 4')

# Check working directory
!pwd

# Check files in directory
!ls

/content/drive/My Drive/MSDS Files/MSDS 458/Assignment 4
Data


## EDA

In [0]:
cols = ['sentiment','id','date','query','user','text']

In [0]:
train_data = '/content/drive/My Drive/MSDS Files/MSDS 458/Assignment 4/Data/trainingandtestdata/training.1600000.processed.noemoticon.csv'
df_train = pd.read_csv(train_data, encoding = 'ISO-8859-1', header=None, names=cols)

In [0]:
df_train.head()

Unnamed: 0,sentiment,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [0]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1600000 non-null  int64 
 1   id         1600000 non-null  int64 
 2   date       1600000 non-null  object
 3   query      1600000 non-null  object
 4   user       1600000 non-null  object
 5   text       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [0]:
df_train.loc[1500]

sentiment                               0
id                             1468168034
date         Tue Apr 07 00:05:12 PDT 2009
query                            NO_QUERY
user                        RopeMarksMuse
text             I am soooo tired @ work 
Name: 1500, dtype: object

Column Definitions

*  0 — the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
*  1 — the id of the tweet (1500)
*  2 — the date of the tweet (Tue Apr 07 00:05:12 PDT 2009)
*  3 — the query (lyx). If there is no query, then this value is NO_QUERY.
*  4 — the user that tweeted (RopeMarksMuse)
*  5 — the text of the tweet (I am soooo tired @ work)

In [0]:
sentiment_pos_count = df_train.sentiment.value_counts().iloc[0]
sentiment_neg_count = df_train.sentiment.value_counts().iloc[1]
print(f'Positive Sentiment Count: {sentiment_pos_count}\nNegative Sentiment Count: {sentiment_neg_count}')

Positive Sentiment Count: 800000
Negative Sentiment Count: 800000


Since there are no "neutral" sentiment values, I will change the positive sentiment polarity values from 4 to 1 for simplicity. The negative labels can remain 0.

In [0]:
df_train.sentiment = df_train.sentiment.replace(to_replace=4,value=1)

In [0]:
df_train.sentiment.value_counts()

1    800000
0    800000
Name: sentiment, dtype: int64

Dropping unnecessary columns to simplify dataframe

In [0]:
df_train.drop(['id','date','query','user'],axis=1,inplace=True)

In [0]:
df_train.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Creating new column that indicates character count of each tweet

In [0]:
df_train['text_len'] = [len(t) for t in df_train.text]
df_train['text_len']

0          115
1          111
2           89
3           47
4          111
          ... 
1599995     56
1599996     78
1599997     57
1599998     65
1599999     62
Name: text_len, Length: 1600000, dtype: int64

In [0]:
df_train['text_len'].max()

374

In [0]:
df_train['text_len'].min()

6

Creating data dictionary to facilitate data exploration and manipulation throughout the project.

In [0]:
data_dict = {
    'sentiment':{
        'type':df_train.sentiment.dtype,
        'description':'sentiment class - 0:negative, 1:positive'
    },
    'text':{
        'type':df_train.text.dtype,
        'description':'tweet text'
    },
    'text_len':{
        'type':df_train.text_len.dtype,
        'description':'Length of the tweet before processing'
    },
    'dataset_shape':df_train.shape
}
pprint(data_dict)

{'dataset_shape': (1600000, 3),
 'sentiment': {'description': 'sentiment class - 0:negative, 1:positive',
               'type': dtype('int64')},
 'text': {'description': 'tweet text', 'type': dtype('O')},
 'text_len': {'description': 'Length of the tweet before processing',
              'type': dtype('int64')}}


Checking how many tweets in the dataset contain more than the allowed 140 character count (note: 140 was the maximum character count for the tweets collected in this dataset)

In [0]:
df_train[df_train.text_len > 140].count()

sentiment    17174
text         17174
text_len     17174
dtype: int64

## Data Pre-Processing

### HTML Decoding

Fixing ‘&amp’,’&quot’,etc. encoding issues using BS4

In [0]:
df_train.text[279]

"Whinging. My client&amp;boss don't understand English well. Rewrote some text unreadable. It's written by v. good writer&amp;reviewed correctly. "

In [0]:
sample_01 = BeautifulSoup(df_train.text[279], 'lxml')
sample_01.get_text()

"Whinging. My client&boss don't understand English well. Rewrote some text unreadable. It's written by v. good writer&reviewed correctly. "

### Removing @ mentions

@ mentions indicate a user but for the purposes of this project, this text does not add value.

In [0]:
df_train.text[3000]

"@islandiva147 I sent u a tweet yesterday but I don't know why it didn't work  I guess you're sleeping right now I am working soon noon !!!"

In [0]:
sample_02 = re.sub(r'@[A-Za-z0-9]+','',df_train.text[3000])
sample_02

" I sent u a tweet yesterday but I don't know why it didn't work  I guess you're sleeping right now I am working soon noon !!!"

### Removing URLs

URLs can be useful for other purposes, but similar to the @ mentions, they do not add value in training a model for sentiment analysis

In [0]:
df_train.text[100]

' Body Of Missing Northern Calif. Girl Found: Police have found the remains of a missing Northern California girl .. http://tr.im/imji'

In [0]:
sample_03 = re.sub('https?://[A-Za-z0-9./]+','',df_train.text[100])
sample_03

' Body Of Missing Northern Calif. Girl Found: Police have found the remains of a missing Northern California girl .. '

### Resolving UTF-8 BOM (Byte Order Mark) 

In [0]:
df_train.text[226]

'Tuesdayï¿½ll start with reflection ï¿½n then a lecture in Stress reducing techniques. That sure might become very useful for us accompaniers '

In [0]:
sample_04 = df_train.text[226].encode().decode('utf-8-sig')
sample_04

'Tuesdayï¿½ll start with reflection ï¿½n then a lecture in Stress reducing techniques. That sure might become very useful for us accompaniers '

In [0]:
sample_04.replace(u'ï¿½', '?')

'Tuesday?ll start with reflection ?n then a lecture in Stress reducing techniques. That sure might become very useful for us accompaniers '

### Removing Tweet Hashtags

In [0]:
df_train.text[785]

"@hadtobeyou I'm at 900 words, it's all can do  I'll finish tomorrow maybe"

Experimenting with various hashtag+symbols removal methods

In [0]:
sample_05 = re.sub('[^a-zA-Z]', ' ', df_train.text[785])
sample_05

' hadtobeyou I m at     words  it s all can do  I ll finish tomorrow maybe'

In [0]:
sample_06 = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", df_train.text[785]).split())
sample_06

'I m at 900 words it s all can do I ll finish tomorrow maybe'

The second method seems to be more effective in cleaning the tweet.

### Tweet Cleaner Function

In [0]:
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

In [0]:
re1 = r'@[A-Za-z0-9_]+'
re2 = r'https?://[^ ]+'
combined_re = r'|'.join((re1, re2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tweet_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_re, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

In [0]:
cleaner_test = df_train.text[:100]

In [0]:
test_result = []
for i in cleaner_test:
  test_result.append(tweet_cleaner(i))

In [0]:
test_result

['awww that bummer you shoulda got david carr of third day to do it',
 'is upset that he can not update his facebook by texting it and might cry as result school today also blah',
 'dived many times for the ball managed to save the rest go out of bounds',
 'my whole body feels itchy and like its on fire',
 'no it not behaving at all mad why am here because can not see you all over there',
 'not the whole crew',
 'need hug',
 'hey long time no see yes rains bit only bit lol fine thanks how you',
 'nope they did not have it',
 'que me muera',
 'spring break in plain city it snowing',
 'just re pierced my ears',
 'could not bear to watch it and thought the ua loss was embarrassing',
 'it it counts idk why did either you never talk to me anymore',
 'would ve been the first but did not have gun not really though zac snyder just doucheclown',
 'wish got to watch it with you miss you and how was the premiere',
 'hollis death scene will hurt me severely to watch on film wry is directors cut no

In [0]:
df_train_split = [0, 400000, 800000, 1200000, 1600000]

In [0]:
%%time
print("Cleaning tweets...\n")
clean_tweet_text = []
for i in range(df_train_split[0],df_train_split[1]):
    if( (i+1)%10000 == 0 ):                
        print("Tweets %d of %d has been processed" % ( i+1, df_train_split[1] ))                                                  
    clean_tweet_text.append(tweet_cleaner(df_train.text[i]))

Cleaning tweets...

Tweets 10000 of 400000 has been processed
Tweets 20000 of 400000 has been processed
Tweets 30000 of 400000 has been processed
Tweets 40000 of 400000 has been processed
Tweets 50000 of 400000 has been processed
Tweets 60000 of 400000 has been processed
Tweets 70000 of 400000 has been processed
Tweets 80000 of 400000 has been processed
Tweets 90000 of 400000 has been processed
Tweets 100000 of 400000 has been processed
Tweets 110000 of 400000 has been processed
Tweets 120000 of 400000 has been processed
Tweets 130000 of 400000 has been processed
Tweets 140000 of 400000 has been processed
Tweets 150000 of 400000 has been processed
Tweets 160000 of 400000 has been processed
Tweets 170000 of 400000 has been processed
Tweets 180000 of 400000 has been processed
Tweets 190000 of 400000 has been processed
Tweets 200000 of 400000 has been processed
Tweets 210000 of 400000 has been processed
Tweets 220000 of 400000 has been processed
Tweets 230000 of 400000 has been processed


In [0]:
len(clean_tweet_text)

400000

In [0]:
%%time
print("Cleaning tweets...\n")
for i in range(df_train_split[1],df_train_split[2]):
    if( (i+1)%10000 == 0 ):                
        print("Tweets %d of %d has been processed" % ( i+1, df_train_split[2] ))                                                  
    clean_tweet_text.append(tweet_cleaner(df_train.text[i]))

Cleaning tweets...

Tweets 410000 of 800000 has been processed
Tweets 420000 of 800000 has been processed
Tweets 430000 of 800000 has been processed
Tweets 440000 of 800000 has been processed
Tweets 450000 of 800000 has been processed
Tweets 460000 of 800000 has been processed
Tweets 470000 of 800000 has been processed
Tweets 480000 of 800000 has been processed
Tweets 490000 of 800000 has been processed
Tweets 500000 of 800000 has been processed
Tweets 510000 of 800000 has been processed
Tweets 520000 of 800000 has been processed
Tweets 530000 of 800000 has been processed
Tweets 540000 of 800000 has been processed
Tweets 550000 of 800000 has been processed
Tweets 560000 of 800000 has been processed
Tweets 570000 of 800000 has been processed
Tweets 580000 of 800000 has been processed
Tweets 590000 of 800000 has been processed
Tweets 600000 of 800000 has been processed
Tweets 610000 of 800000 has been processed
Tweets 620000 of 800000 has been processed
Tweets 630000 of 800000 has been p

In [0]:
len(clean_tweet_text)

800000

In [0]:
%%time
print("Cleaning tweets...\n")
for i in range(df_train_split[2],df_train_split[3]):
    if( (i+1)%10000 == 0 ):                
        print("Tweets %d of %d has been processed" % ( i+1, df_train_split[3] ))                                                  
    clean_tweet_text.append(tweet_cleaner(df_train.text[i]))

Cleaning tweets...

Tweets 810000 of 1200000 has been processed
Tweets 820000 of 1200000 has been processed
Tweets 830000 of 1200000 has been processed
Tweets 840000 of 1200000 has been processed
Tweets 850000 of 1200000 has been processed
Tweets 860000 of 1200000 has been processed
Tweets 870000 of 1200000 has been processed
Tweets 880000 of 1200000 has been processed
Tweets 890000 of 1200000 has been processed
Tweets 900000 of 1200000 has been processed
Tweets 910000 of 1200000 has been processed
Tweets 920000 of 1200000 has been processed
Tweets 930000 of 1200000 has been processed
Tweets 940000 of 1200000 has been processed
Tweets 950000 of 1200000 has been processed
Tweets 960000 of 1200000 has been processed
Tweets 970000 of 1200000 has been processed
Tweets 980000 of 1200000 has been processed
Tweets 990000 of 1200000 has been processed
Tweets 1000000 of 1200000 has been processed
Tweets 1010000 of 1200000 has been processed
Tweets 1020000 of 1200000 has been processed
Tweets 10

In [0]:
len(clean_tweet_text)

1200000

In [0]:
%%time
print("Cleaning tweets...\n")
for i in range(df_train_split[3],df_train_split[4]):
    if( (i+1)%10000 == 0 ):                
        print("Tweets %d of %d has been processed" % ( i+1, df_train_split[4] ))                                                  
    clean_tweet_text.append(tweet_cleaner(df_train.text[i]))

Cleaning tweets...

Tweets 1210000 of 1600000 has been processed
Tweets 1220000 of 1600000 has been processed
Tweets 1230000 of 1600000 has been processed
Tweets 1240000 of 1600000 has been processed
Tweets 1250000 of 1600000 has been processed
Tweets 1260000 of 1600000 has been processed
Tweets 1270000 of 1600000 has been processed
Tweets 1280000 of 1600000 has been processed
Tweets 1290000 of 1600000 has been processed
Tweets 1300000 of 1600000 has been processed
Tweets 1310000 of 1600000 has been processed
Tweets 1320000 of 1600000 has been processed
Tweets 1330000 of 1600000 has been processed
Tweets 1340000 of 1600000 has been processed
Tweets 1350000 of 1600000 has been processed
Tweets 1360000 of 1600000 has been processed
Tweets 1370000 of 1600000 has been processed
Tweets 1380000 of 1600000 has been processed
Tweets 1390000 of 1600000 has been processed
Tweets 1400000 of 1600000 has been processed
Tweets 1410000 of 1600000 has been processed
Tweets 1420000 of 1600000 has been 

In [0]:
len(clean_tweet_text)

1600000

## Saving Processed Tweets as New CSV

In [0]:
df_train_clean = pd.DataFrame(clean_tweet_text,columns=['text'])
df_train_clean['target'] = df_train.sentiment
df_train_clean.head()

Unnamed: 0,text,target
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


In [0]:
df_train_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1600000 non-null  object
 1   target  1600000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


Checking for null values. No null values, but empty strings (' ')

In [0]:
df_train_clean[df_train_clean.isnull().any(axis=1)].head()

Unnamed: 0,text,target


In [0]:
np.sum(df_train_clean.isnull().any(axis=1))

0

In [0]:
df_train_clean[df_train_clean['text'] == ''].index

Int64Index([    208,     249,     282,     398,     430,    1011,    1014,
               1231,    1421,    1486,
            ...
            1596542, 1596670, 1597191, 1597326, 1597684, 1598192, 1598272,
            1599247, 1599494, 1599993],
           dtype='int64', length=3959)

In [0]:
print(df_train.text[1421])
print(df_train.text[1486])
print(df_train.text[1597684])
print(df_train.text[1599494])

@marlonjenglish 
@oishiieats 
@patty4sound http://twitpic.com/7iuns - 
@Sworn4DaBosses 


The empty strings represent tweets text that only contain @mentions and/or URLs. As previously stated, this data does not contribute to sentiment analysis training models so I will drop the records from the dataset.

In [0]:
df_train_clean['text'].replace('', np.nan, inplace=True)

In [0]:
np.sum(df_train_clean.isnull().any(axis=1))

3959

In [0]:
print(df_train_clean.text[1421])
print(df_train_clean.text[1486])
print(df_train_clean.text[1597684])
print(df_train_clean.text[1599494])

nan
nan
nan
nan


In [0]:
df_train_clean.dropna(inplace=True)
df_train_clean.reset_index(drop=True,inplace=True)
df_train_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1596041 entries, 0 to 1596040
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596041 non-null  object
 1   target  1596041 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [0]:
df_train_clean.to_csv('clean_tweets.csv',encoding='utf-8')
csv = 'clean_tweets.csv'
my_df = pd.read_csv(csv,index_col=0)
my_df.head()

  mask |= (ar1 == a)


Unnamed: 0,text,target
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0
