# Initial EDA Information

This includes basic statistics on the set as a whole.

In [1]:
import pandas as pd
import numpy as np

# Read the .csv dataset that was combined from all scraped tweets.
df = pd.read_csv(r'..\init_scrape\combined_csv.csv')

c:\Users\charl\anaconda3\envs\capstone\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
c:\Users\charl\anaconda3\envs\capstone\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


In [2]:
# Quickly anonymize "User" column
df = df.assign(User=df.User.factorize()[0] + 1)

## Quick Review of Dataset

In [3]:
# Count number of trending topics
trends = df.Trend.unique()
print('Number of trending tweet topics: ',len(trends)-1)

Number of trending tweet topics:  44


In [4]:
# Print the trending topics
print('\nThe trending topics our group scraped were: \n')
for i in trends:
    print(i, '\n', end = ' ')


The trending topics our group scraped were: 

Bing 
 Harry Potter 
 Judgement Day 
 New Ceo 
 State of the Union 
 Super Bowl 
 Tim Kelly 
 Turkey 
 Twix 
 Wrexham 
 Arkansas 
 Club Renaissance 
 Elden Ring 
 Gen Z 
 LakeShow 
 Mitch McConnell 
 Most Americans 
 Russians 
 nan 
 Taibbi 
 The President 
 UFOs 
 Alaska 
 Barcelona 
 Elizabeth Olsen 
 Europa League 
 Fulton County 
 Nintendo Direct 
 PowerBall 
 South Park 
 The Last of US 
 Xavi 
 All Lies 
 Creepy Joe 
 dementia 
 Fox News 
 Hogwarts Legacy 
 Leonardo DiCaprio 
 melanoma 
 Ohio 
 Rep George Santos 
 Sinema 
 Spartans 
 Switch 2 
 The Bible 
 

In [5]:
# Find the size of the database
print("Number of tweets in dataset: ", len(df))

Number of tweets in dataset:  21348


## Data Cleaning

The following cell demonstrates a quick Python command that was used to clean Twitter's GeoJSON string by stripping all unnecessary characters away, leaving only a tuple.

In [6]:
%%capture
# Clean the GeoJSON formatting so that it is only a tuple.
df['GeoJSON'] = df['GeoJSON'].str.replace('Coordinates\(longitude=','(')
df['GeoJSON'] = df['GeoJSON'].str.replace('latitude=','')

## Data Analysis

The following cells will demonstrate basic exploration of the data. We initially wanted to know how many users were in the dataset versus how many Tweets were in the dataset.

In [7]:
# Find the number of tweets per user
Tweets_Per_User = df.User.value_counts().to_frame()

# Converting "Tweets_Per_User" Series to  new df and assign new names to the columns
df_Tweets_Per_User = pd.DataFrame(Tweets_Per_User)
df_Tweets_Per_User = df_Tweets_Per_User.reset_index()
df_Tweets_Per_User.columns = ['User', 'Number of Tweets'] # change column names
df_Tweets_Per_User

Unnamed: 0,User,Number of Tweets
0,16826,128
1,10715,57
2,8931,54
3,4558,50
4,4513,45
...,...,...
17626,6193,1
17627,6194,1
17628,6195,1
17629,6196,1


The above output shows that out of 21,348 Tweets, there were only 17,631 Users. Some users appear dozens of times.

The following cell demonstrates how many tweets featured GeoJSON data.

In [8]:
# Create a new df of just Trend, User and GeoJSON data
Geo_Tweets = df[['Trend','User','GeoJSON']].copy()
Geo_Tweets = Geo_Tweets.dropna(subset=['GeoJSON'])
# Count the number of GeoJSON entries. 
print("Number of tweets featuring location coordinates: ",len(Geo_Tweets))
# Percentage of GeoJSON Tweets
print("Percentage of GeoJSON tweets in dataset: ","{0:.1%}".format(len(Geo_Tweets)/len(df)))

Number of tweets featuring location coordinates:  370
Percentage of GeoJSON tweets in dataset:  1.7%


After finding how many Tweets featured GeoJSON data, we wanted to know how many coordinates were repeated. The following cell demonstrates repeated coordinates.

In [9]:
Count_Geo = Geo_Tweets.GeoJSON.value_counts().to_frame()
# Converting GeoJSON value counts Series to  new df and assign new names to the columns
df_Geo_Counts = pd.DataFrame(Count_Geo)
df_Geo_Counts = df_Geo_Counts.reset_index()
df_Geo_Counts.columns = ['Coordinates', 'Count'] # change column names

# Quickly anonymize "User" column
df_Geo_Counts = df_Geo_Counts.assign(Coordinates = df_Geo_Counts.Coordinates.factorize()[0] + 1)

# View 
df_Geo_Counts.head(50)

Unnamed: 0,Coordinates,Count
0,1,13
1,2,11
2,3,7
3,4,6
4,5,5
5,6,5
6,7,4
7,8,4
8,9,4
9,10,4


The above output shows that there were at least 50 coordinates posted at least twice.

The following cells demonstrate how many GeoJSON Tweeters posted GeoJSON Tweets more than once in the dataset.

In [10]:
%%capture
# Create a dataframe of only 'User' and 'GeoJSON' columns.
User_and_Geo = Geo_Tweets[['User','GeoJSON']].copy()
User_and_Geo.groupby(by='User').agg('count')


In [11]:
User_and_Geo = User_and_Geo.User.value_counts().to_frame()
df_Geo_Per_User = pd.DataFrame(User_and_Geo)
df_Geo_Per_User = df_Geo_Per_User.reset_index()
df_Geo_Per_User.columns = ['User', 'Number of Geo Tweets'] # change column names
df_Geo_Per_User.head(30)

Unnamed: 0,User,Number of Geo Tweets
0,2730,14
1,6077,7
2,3841,6
3,12313,3
4,3658,3
5,6590,3
6,9788,3
7,3734,2
8,16503,2
9,2780,2


The output shown above shows that there were almost 30 Users who Tweeted GeoJSON coordinates at least twice. 