# Twitter Health Data
### Atheer Al Attar, March 2018
These are a collection of tweets from several health news agencies along the time, divided into several files each for an agency. We will try to import the data, visualize it and perform exploratory data analysis. The source of the data is University of California at Irvine repo.

Table of Contents:

1. Data Import and unzipping.
2. Reading the directory, concatenating the different files into one
3. Reading and aggregating text files.
4. Preparing the corpus, data cleaning
5. Word Frequencies generation


### 1. Data Import and unzipping.


In [4]:
import pandas as pd

In [5]:
import requests, zipfile, io
r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()


In [6]:
import zipfile, urllib.request, shutil

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip'
file_name = 'myzip.zip'
zip_ref = zipfile.ZipFile("./myzip.zip", 'r')
zip_ref.extractall("./")
zip_ref.close()

### 2. Reading the directory, concatenating the different files into one
We will use the OS package to list the files names into a list and then iterate over them.

In [7]:
import glob
files_list = glob.glob("./Health-Tweets/*.txt")
files_list[0]


'./Health-Tweets/cbchealth.txt'

### 3. Reading the text files
I noticed that some of the files have several rows that are mistakenly formatted, I used the code below to skip them. I performed some housekeeping in this step, renamed columns, changed the date type to the correct type. I have also corrected the index to be consistent other than the files individual index.

In [32]:
from os import listdir
from os.path import isfile, join
df=pd.DataFrame()
for file in files_list:
    try:
        #x=df.append(pd.read_csv(file, sep="|", comment="#", na_values=['Nothing'],header=None))
        x=pd.read_csv(file, sep="|", comment="#", na_values=['Nothing'],header=None)
        x['file_name']=file
        df=df.append(x)
    except Exception:
        pass

### 4. Preparing the corpus, data cleaning
In this stage the text data will be cleaned from punctuations, stop words and word counts column will be added, in addition to naming the rest of the columns.

In [33]:
df.columns=['id','date','tweet','source']
df.drop(['id'], axis=1)
df.date=pd.to_datetime(df.date)
df['index']=range(len(df))
df.index=df['index']
df['source'] = df.source.str[16:]

#Generating word counts
df['word_count'] = df['tweet'].apply(lambda x: len(str(x).split(" ")))

#Removing Stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop.extend(['would', 'like'] )
df.tweet=df.tweet.apply(str)
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

#Removing Punctuations
df['tweet']=df['tweet'].str.replace('[^\w\s]','')

#removing the http links

df['tweet']=df['tweet'].str.replace('(\s)http\w+','')

#check this website for more info on regex https://regex101.com

df.drop('id', inplace=True, axis=1)

df.date=pd.to_datetime(df.date)


### 5. Word Frequencies Generation

In [10]:
def row_word_count(string,dic):
    for word in string.split(" "):
        if word in dic.keys():
            dic[word]=dic[word]+1
        else:
            dic[word]=1
    return dic

In [11]:
import operator
dic={}
for tweets in df.tweet:
    row_word_count(tweets,dic)

sorted_d = sorted(dic.items(), key=operator.itemgetter(1), reverse=True)
word_counts=pd.DataFrame(sorted_d,columns=['Word','Count'])

In [12]:
word_counts=word_counts.iloc[1:20,:]

### Need to do next
1- Plot highest word by agency
2- Plot a timeline for
3- Number of tweets per agency per time
4- word cloud per agency