# Twitter Exploratory Data Analysis

___

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Increase-Max-Rows,-Columns,-Width" data-toc-modified-id="Increase-Max-Rows,-Columns,-Width-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Increase Max Rows, Columns, Width</a></span></li></ul></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Data-Information-+-Description" data-toc-modified-id="Data-Information-+-Description-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Information + Description</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Check-df.head()-to-make-sure-the-data-came-through" data-toc-modified-id="Check-df.head()-to-make-sure-the-data-came-through-3.0.1"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>Check df.head() to make sure the data came through</a></span></li><li><span><a href="#Check-df.info()-to-understand-our-datatypes" data-toc-modified-id="Check-df.info()-to-understand-our-datatypes-3.0.2"><span class="toc-item-num">3.0.2&nbsp;&nbsp;</span>Check df.info() to understand our datatypes</a></span></li><li><span><a href="#Check-df.isna()-to-check-for-NaN-values" data-toc-modified-id="Check-df.isna()-to-check-for-NaN-values-3.0.3"><span class="toc-item-num">3.0.3&nbsp;&nbsp;</span>Check df.isna() to check for NaN values</a></span></li><li><span><a href="#Use-df.fillna()-to-eliminate-any-NaN-values" data-toc-modified-id="Use-df.fillna()-to-eliminate-any-NaN-values-3.0.4"><span class="toc-item-num">3.0.4&nbsp;&nbsp;</span>Use df.fillna() to eliminate any NaN values</a></span></li><li><span><a href="#Check-df.isna()-again-to-make-sure-all-NaN-values-are-gone" data-toc-modified-id="Check-df.isna()-again-to-make-sure-all-NaN-values-are-gone-3.0.5"><span class="toc-item-num">3.0.5&nbsp;&nbsp;</span>Check df.isna() again to make sure all NaN values are gone</a></span></li><li><span><a href="#Use-df.describe()-to-get-an-overall-sense-of-the-essential-metrics-for-our-dataset" data-toc-modified-id="Use-df.describe()-to-get-an-overall-sense-of-the-essential-metrics-for-our-dataset-3.0.6"><span class="toc-item-num">3.0.6&nbsp;&nbsp;</span>Use df.describe() to get an overall sense of the essential metrics for our dataset</a></span></li></ul></li></ul></li><li><span><a href="#Check-For-Multicollinearity" data-toc-modified-id="Check-For-Multicollinearity-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Check For Multicollinearity</a></span></li><li><span><a href="#Check-For-Heteroskedasticity" data-toc-modified-id="Check-For-Heteroskedasticity-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Check For Heteroskedasticity</a></span></li><li><span><a href="#Explore" data-toc-modified-id="Explore-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Explore</a></span></li></ul></div>

___

## Setup

### Import Libraries

In [2]:
import sys
sys.path.append("..")
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import numpy as np
import pandas as pd
import requests
import json
import math
import sklearn
from scipy import stats
from scipy.stats import norm
from sklearn.utils import resample
import pickle
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as stats
from wordcloud import WordCloud
import random
from collections import Counter
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV, Lasso, Ridge, LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import roc_curve, auc, confusion_matrix
import scipy.stats as stats

### Increase Max Rows, Columns, Width

In [3]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

___

## Load Data

In [9]:
df = pd.read_csv('Twitter_Data.csv', index_col=0)

___

## Data Information + Description

#### Check df.head() to make sure the data came through

In [11]:
df.head(5)

Unnamed: 0,deets_author_id,deets_created_at,deets_tweet_id,deets_language,deets_possibly_sensitive,deets_text,deets_retweet_count,deets_reply_count,deets_like_count,deets_quote_count,deets_text_length,user_created_at,user_description,user_id,user_real_name,user_profile_image_url,user_protected,user_url,username,verified,format,user_followers_count,user_following_count,user_tweet_count,user_listed_count,user_description_length
0,25073877,2020-01-31T00:19:23.000Z,1223038027234267137,en,False,"Great poll in Iowa, where I just landed for a ...",2406,1346,8517,192,91,2009-03-18T13:46:38.000Z,45th President of the United States of America🇺🇸,25073877,Donald J. Trump,https://pbs.twimg.com/profile_images/874276197...,False,https://t.co/OMxB0x7xC5,realDonaldTrump,True,detailed,71873110,47,48500,113961,48
1,783214,2020-01-30T20:00:13.000Z,1222972807639896064,en,False,"Yes, it's still January",28532,3049,77352,4354,23,2007-02-20T14:35:54.000Z,What’s happening?!,783214,Twitter,https://pbs.twimg.com/profile_images/111172963...,False,https://t.co/TAXQpsHa5X,Twitter,True,detailed,57135266,1,12909,90552,18
2,409486555,2020-01-20T13:56:40.000Z,1219257438949597185,en,False,"To honor Dr. King's legacy, we all can play a ...",10107,508,44343,253,303,2011-11-10T20:13:01.000Z,Girl from the South Side and former First Lady...,409486555,Michelle Obama,https://pbs.twimg.com/profile_images/119281123...,False,https://t.co/0UVvR5L6vm,MichelleObama,True,detailed,14331923,18,1225,24897,96
3,14130366,2020-01-26T05:08:25.000Z,1221298825786089472,en,False,"This Black History Month, we’ll be celebrating...",153,56,1068,8,284,2008-03-12T05:51:53.000Z,"CEO, Google and Alphabet",14130366,Sundar Pichai,https://pbs.twimg.com/profile_images/864282616...,False,,sundarpichai,True,detailed,2636216,337,1316,6747,25
4,1636590253,2020-01-27T15:15:10.000Z,1221813908576464896,en,False,"Today, we remember the millions of lives lost....",678,119,5556,32,273,2013-07-31T22:41:25.000Z,Apple CEO  Auburn 🏀 🏈 Duke 🏀 National Parks 🏞...,1636590253,Tim Cook,https://pbs.twimg.com/profile_images/119411373...,False,,tim_cook,True,detailed,11677552,68,976,21430,135


#### Check df.info() to understand our datatypes

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 26 columns):
deets_author_id             5 non-null int64
deets_created_at            5 non-null object
deets_tweet_id              5 non-null int64
deets_language              5 non-null object
deets_possibly_sensitive    5 non-null bool
deets_text                  5 non-null object
deets_retweet_count         5 non-null int64
deets_reply_count           5 non-null int64
deets_like_count            5 non-null int64
deets_quote_count           5 non-null int64
deets_text_length           5 non-null int64
user_created_at             5 non-null object
user_description            5 non-null object
user_id                     5 non-null int64
user_real_name              5 non-null object
user_profile_image_url      5 non-null object
user_protected              5 non-null bool
user_url                    3 non-null object
username                    5 non-null object
verified                    5 non-nul

#### Check df.isna() to check for NaN values

In [15]:
df.isna().sum()

deets_author_id             0
deets_created_at            0
deets_tweet_id              0
deets_language              0
deets_possibly_sensitive    0
deets_text                  0
deets_retweet_count         0
deets_reply_count           0
deets_like_count            0
deets_quote_count           0
deets_text_length           0
user_created_at             0
user_description            0
user_id                     0
user_real_name              0
user_profile_image_url      0
user_protected              0
user_url                    2
username                    0
verified                    0
format                      0
user_followers_count        0
user_following_count        0
user_tweet_count            0
user_listed_count           0
user_description_length     0
dtype: int64

#### Use df.fillna() to eliminate any NaN values

- Impute the **mean** if the data is **normally distributed.**
- Impute the **median** if the data is **non-normally distributed.**
- Impute "0" if the NaN value does not matter to our analysis.

In [19]:
df = df.fillna(0)

#### Check df.isna() again to make sure all NaN values are gone

In [20]:
df.isna().sum()

deets_author_id             0
deets_created_at            0
deets_tweet_id              0
deets_language              0
deets_possibly_sensitive    0
deets_text                  0
deets_retweet_count         0
deets_reply_count           0
deets_like_count            0
deets_quote_count           0
deets_text_length           0
user_created_at             0
user_description            0
user_id                     0
user_real_name              0
user_profile_image_url      0
user_protected              0
user_url                    0
username                    0
verified                    0
format                      0
user_followers_count        0
user_following_count        0
user_tweet_count            0
user_listed_count           0
user_description_length     0
dtype: int64

#### Use df.describe() to get an overall sense of the essential metrics for our dataset

In [21]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
deets_author_id,5.0,417212900.0,702958600.0,783214.0,14130370.0,25073880.0,409486600.0,1636590000.0
deets_tweet_id,5.0,1.221676e+18,1544955000000000.0,1.219257e+18,1.221299e+18,1.221814e+18,1.222973e+18,1.223038e+18
deets_retweet_count,5.0,8375.2,11955.95,153.0,678.0,2406.0,10107.0,28532.0
deets_reply_count,5.0,1015.6,1247.624,56.0,119.0,508.0,1346.0,3049.0
deets_like_count,5.0,27367.2,32822.87,1068.0,5556.0,8517.0,44343.0,77352.0
deets_quote_count,5.0,967.8,1895.791,8.0,32.0,192.0,253.0,4354.0
deets_text_length,5.0,194.8,128.5193,23.0,91.0,273.0,284.0,303.0
user_id,5.0,417212900.0,702958600.0,783214.0,14130370.0,25073880.0,409486600.0,1636590000.0
user_followers_count,5.0,31530810.0,30854260.0,2636216.0,11677550.0,14331920.0,57135270.0,71873110.0
user_following_count,5.0,94.2,138.1655,1.0,18.0,47.0,68.0,337.0


___

## Check For Multicollinearity

___

## Check For Heteroskedasticity

___

## Explore