# _Dec. 27, 2019_

In [63]:
import pandas as pd
pd.set_option("display.max_columns", None)

dtype = {
    "id_str": str
}

# load in data
df = pd.read_json("json-data/no_retweets.json", dtype=dtype, orient="split")

In [64]:
# create a subset of the data to experiment with
subset = df.sample(n=1000, random_state=1)

In [65]:
# reset index so it is easier to work with
subset.reset_index(drop=True, inplace=True)

## _Cleaning Text: Subset_

In [66]:
import re

def text_clean(text):
    """
    Function for cleaning Tweet text.
    """
    # eliminate newline characters
    text = text.strip("\n\n")
    text = text.strip("\n")
    # delete links and any alphanumeric character and the underscore, strips whitespace
    text = re.sub(r"http\S+|[^\w\s]", "", text).strip(" ")
    # replaces any non-alphanumeric character with blank space
    text = re.sub("\W", " ", text)
    # removes any extra whitespace
    text = re.sub("\s+", " ", text)
    return text

In [67]:
text_clean(subset["full_text"][24])

'Gavin Lux just hit career homer 1 Earlier today the Dodgers rookie chatted with AlexaDatt and ScottBraun on TheRundown'

In [68]:
for text in subset["full_text"][:10].apply(lambda x: text_clean(x)):
    print(text, "\n")

The National Immigrant Justice Center or NIJC helps asylumseekers win their cases and fights against renewed attempts to separate and jail more families 

FizanKhan1 Hi Fizan This guide may help If something looks outofline we suggest adding extra layers of security to your Google Account with these tips 

A 50yd field goal in the University of Phoenix Stadium deflects about onethird inch to the right due to Earths rotation 

Emma Stone Just Got Engaged To Dave McCary And She Looks So Happy 

New York or Nowhere vibes 

Where three womens journey from Cuba to America left them 

Wow guess Im not a Yankees fan anymore LETS GO METS Man you cant trust 

SUBSCRIBE to my YOUTUBE channel SMSAUDIO EFFENVODKA 

I dont get mad I get motivated 

Incognito at NYCC Grunkle Stan had some fun today A big thanks to the fans javitscenter and ComicCon gravityfalls 



In [69]:
%%time

# see how long it takes to apply the text_clean function to entire subset
subset["clean_text"] = subset["full_text"].apply(lambda x: text_clean(x))

CPU times: user 25.2 ms, sys: 1.35 ms, total: 26.6 ms
Wall time: 26.1 ms


In [70]:
# compare 
subset.loc[[0,1], ["full_text", "clean_text"]]

Unnamed: 0,full_text,clean_text
0,"The National Immigrant Justice Center, or @NIJ...",The National Immigrant Justice Center or NIJC ...
1,@FizanKhan1 Hi Fizan. This guide may help: htt...,FizanKhan1 Hi Fizan This guide may help If som...


In [71]:
for i, row in subset[:4].iterrows():
    print(row["full_text"], "\n")
    print(row["clean_text"], "\n", "-"*30)

The National Immigrant Justice Center, or @NIJC, helps asylum-seekers win their cases and fights against renewed attempts to separate and jail more families. https://t.co/myOcT2Q6fu 

The National Immigrant Justice Center or NIJC helps asylumseekers win their cases and fights against renewed attempts to separate and jail more families 
 ------------------------------
@FizanKhan1 Hi Fizan. This guide may help: https://t.co/VRPMm6xAcH. If something looks out-of-line, we suggest adding extra layers of security to your Google Account with these tips: https://t.co/k6X4NghPTb. 

FizanKhan1 Hi Fizan This guide may help If something looks outofline we suggest adding extra layers of security to your Google Account with these tips 
 ------------------------------
A 50-yd field goal, in the University of Phoenix Stadium, deflects about one-third inch to the right due to Earth's rotation 

A 50yd field goal in the University of Phoenix Stadium deflects about onethird inch to the right due to Earth

In [72]:
# test with a sample of 5 observations
sample = subset.sample(n=5, random_state=1)

In [73]:
for i, row in sample[:4].iterrows():
    print(row["full_text"], "\n")
    print(row["clean_text"], "\n", "-"*30)

Take no revenge and cherish no grudge against your own people. You shall love your neighbor as yourself.

Leviticus 19:18 

Take no revenge and cherish no grudge against your own people You shall love your neighbor as yourself Leviticus 1918 
 ------------------------------
#FBI Director Wray: At the end of the day, we all want the same thing: to protect our innovation, our systems, and above all, our people. #ICCS2018 

FBI Director Wray At the end of the day we all want the same thing to protect our innovation our systems and above all our people ICCS2018 
 ------------------------------
The House’s vote to impeach President Trump drew an immediate reaction from the federal judiciary as a federal appeals court demanded answers on what impact the historic move may have on ongoing legal efforts to obtain records and testimony https://t.co/SpQr6KL58J 

The Houses vote to impeach President Trump drew an immediate reaction from the federal judiciary as a federal appeals court demanded ans

## _Cleaning Text: `json-data/no_retweets.json`_

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211104 entries, 0 to 211103
Data columns (total 8 columns):
id_str            211104 non-null object
screen_name       211104 non-null object
created_at        211104 non-null datetime64[ns]
lang              211104 non-null object
source            211104 non-null object
retweet_count     211104 non-null int64
favorite_count    211104 non-null int64
full_text         211104 non-null object
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 14.5+ MB


In [59]:
%%timeit

df["full_text"].apply(lambda x: text_clean(x))

4.72 s ± 352 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [75]:
%%time 

# create clean text column
df["clean_text"] = df["full_text"].apply(lambda x: text_clean(x))

CPU times: user 4.21 s, sys: 10.5 ms, total: 4.22 s
Wall time: 4.31 s


In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211104 entries, 0 to 211103
Data columns (total 9 columns):
id_str            211104 non-null object
screen_name       211104 non-null object
created_at        211104 non-null datetime64[ns]
lang              211104 non-null object
source            211104 non-null object
retweet_count     211104 non-null int64
favorite_count    211104 non-null int64
full_text         211104 non-null object
clean_text        211104 non-null object
dtypes: datetime64[ns](1), int64(2), object(6)
memory usage: 16.1+ MB


## _Add Label Column_

In [87]:
# insert label column for ML/DL
df["label"] = "real"

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211104 entries, 0 to 211103
Data columns (total 10 columns):
id_str            211104 non-null object
screen_name       211104 non-null object
created_at        211104 non-null datetime64[ns]
lang              211104 non-null object
source            211104 non-null object
retweet_count     211104 non-null int64
favorite_count    211104 non-null int64
full_text         211104 non-null object
clean_text        211104 non-null object
label             211104 non-null object
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 17.7+ MB


In [89]:
df.head(3)

Unnamed: 0,id_str,screen_name,created_at,lang,source,retweet_count,favorite_count,full_text,clean_text,label
0,1209319045411028992,NBCNews,2019-12-24 03:45:03,en,SocialFlow,12,28,A North Carolina man is accused of using eye d...,A North Carolina man is accused of using eye d...,real
1,1209315792657043456,NBCNews,2019-12-24 03:32:07,en,SocialFlow,4,16,11 best gifts and gadgets for home cooks. http...,11 best gifts and gadgets for home cooks NBCNe...,real
2,1209308993459572736,NBCNews,2019-12-24 03:05:06,en,SocialFlow,14,61,A woman upset that KFC got sandwich wrong call...,A woman upset that KFC got sandwich wrong call...,real


## _Shuffle/Split Data into Train & Test sets_

In [90]:
from sklearn.model_selection import train_test_split

In [92]:
train, test = train_test_split(df, test_size=0.2, random_state=1, shuffle=True)

In [102]:
len(train), len(test)

(168883, 42221)

In [104]:
train = train.reset_index(drop=True)
train[:5]

Unnamed: 0,id_str,screen_name,created_at,lang,source,retweet_count,favorite_count,full_text,clean_text,label
0,475980318382383104,tonyhawk,2014-06-09 12:38:45,en,Twitter for iPhone,65,142,"June 11, 9pm: the 2014 @gumball3000 concludes ...",June 11 9pm the 2014 gumball3000 concludes wit...,real
1,1039495495494918144,SenSchumer,2018-09-11 12:46:53,en,Twitter for iPhone,323,735,"As @USATODAY reported, “By the end of 2018, ma...",As USATODAY reported By the end of 2018 many e...,real
2,912056752681766912,neiltyson,2017-09-24 20:50:49,en,TweetDeck,1000,10291,"I thought the frozen dead dudes couldn’t swim,...",I thought the frozen dead dudes couldnt swim b...,real
3,944420654497042432,ChuckGrassley,2017-12-23 04:13:25,es,Twitter for iPhone,6,55,UNI loses to Xavier 77/67,UNI loses to Xavier 7767,real
4,659150226431811584,BarackObama,2015-10-27 23:30:35,en,Twitter Web Client,1175,1109,Retweet to spread the word. #GetCovered https:...,Retweet to spread the word GetCovered,real


In [106]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168883 entries, 0 to 168882
Data columns (total 10 columns):
id_str            168883 non-null object
screen_name       168883 non-null object
created_at        168883 non-null datetime64[ns]
lang              168883 non-null object
source            168883 non-null object
retweet_count     168883 non-null int64
favorite_count    168883 non-null int64
full_text         168883 non-null object
clean_text        168883 non-null object
label             168883 non-null object
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 12.9+ MB


In [105]:
test = test.reset_index(drop=True)
test[:5]

Unnamed: 0,id_str,screen_name,created_at,lang,source,retweet_count,favorite_count,full_text,clean_text,label
0,1067419455158931456,HillaryClinton,2018-11-27 14:06:44,en,Twitter Web Client,381,973,"The National Immigrant Justice Center, or @NIJ...",The National Immigrant Justice Center or NIJC ...,real
1,1190540505404325888,Google,2019-11-02 08:05:50,en,Conversocial,0,0,@FizanKhan1 Hi Fizan. This guide may help: htt...,FizanKhan1 Hi Fizan This guide may help If som...,real
2,562080196489404416,neiltyson,2015-02-02 02:48:56,en,TweetDeck,3621,5050,"A 50-yd field goal, in the University of Phoen...",A 50yd field goal in the University of Phoenix...,real
3,1202422229256097792,BuzzFeed,2019-12-05 02:59:34,en,PubHub by BuzzFeed,51,622,"Emma Stone Just Got Engaged To Dave McCary, An...",Emma Stone Just Got Engaged To Dave McCary And...,real
4,1128314104853278720,nyknicks,2019-05-14 15:00:20,en,Spredfast,624,2279,New York or Nowhere vibes. https://t.co/WW41xo...,New York or Nowhere vibes,real


In [107]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42221 entries, 0 to 42220
Data columns (total 10 columns):
id_str            42221 non-null object
screen_name       42221 non-null object
created_at        42221 non-null datetime64[ns]
lang              42221 non-null object
source            42221 non-null object
retweet_count     42221 non-null int64
favorite_count    42221 non-null int64
full_text         42221 non-null object
clean_text        42221 non-null object
label             42221 non-null object
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 3.2+ MB


## _Create JSONs of train and test data_

In [108]:
train.to_json("verified_train.json", orient="split")

In [109]:
test.to_json("verified_test.json", orient="split")