# COGS 108 - Data Checkpoint

# Names

- Quinton Ramaswamy
- Lauren Gomez
- Michael Wei
- David Marquez
- Alexander G. Arias

<a id='research_question'></a>
# Research Question

**Research Question** <br>
`Version 1` <br>
How does the number, favorite count, and retweet count of NBA players' political tweets affect their salary, on-court performance, and impact to the team during 2020-2021? <br>
`Version 2` <br>
How does the quantity and impact of NBA players' political tweets affect their salary, on-court performance, and impact to the team during 2020-2021 ?

# Dataset(s)

- Dataset Name: main_data.csv
- Link to the dataset:https://www.basketball-reference.com/players/
- Number of observations: 2818

Player stats from 2017-2021 and player salaries from 2017-2020. All data taken from player pages on basketball-reference.com. Salary data for 2021 stats is null

- Dataset Name: salaries_2021.csv
- Link to the dataset:https://www.basketball-reference.com/contracts/players.html
- Number of observations: 563

Player name and corresponding salary for the current NBA season. Taken from table with current NBA contracts

- Dataset Name: player_tweets (awaiting license to be approved)
- Link to the dataset:https://twitter.com/
- Number of observations: about 1200 \* 321 (number of tweets * number of current NBA players on Twitter)

Currently, we have applied for 4 Twitter developer licenses to scrapte player's tweets from the Twitter API, and is expecting to have the license approved by next week. The dataset will contain the followers of the current NBA players on Twitter, the last 1200 tweets of each player from 2021, which contains attributes such as text of the tweet, number of retweets, and favorited count. 

# Setup

Our first step was to install the basketball-reference-scraper package using pip install basketball-reference-scraper

We used their API https://github.com/vishaalagartha/basketball_reference_scraper to obtain our original dataset.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import datetime

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/c4wei/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/c4wei/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package words to /home/c4wei/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package stopwords to /home/c4wei/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
main_data = pd.read_csv('data/main_data.csv', index_col=0)
main_data

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,SALARY,NAME
7,2017-18,28.0,TOT,NBA,SF,50.0,14.0,14.3,1.8,4.3,...,1.8,1.9,0.6,0.2,0.1,0.4,1.1,5.2,"$1,974,159",Luke Babbitt
8,2017-18,28.0,ATL,NBA,SF,37.0,9.0,15.4,2.2,4.5,...,2.1,2.2,0.7,0.2,0.1,0.4,1.3,6.1,"$1,974,159",Luke Babbitt
9,2017-18,28.0,MIA,NBA,SF,13.0,5.0,11.2,0.8,3.6,...,1.0,1.2,0.4,0.1,0.2,0.3,0.5,2.5,"$1,974,159",Luke Babbitt
7,2017-18,28.0,ATL,NBA,SG,65.0,65.0,27.5,4.4,10.4,...,3.4,3.8,3.5,1.5,0.7,2.4,2.3,12.9,"$16,910,113",Kent Bazemore
8,2018-19,29.0,ATL,NBA,SG,67.0,35.0,24.5,4.1,10.3,...,3.3,3.9,2.3,1.3,0.6,1.8,2.5,11.6,"$18,089,887",Kent Bazemore
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7,2018-19,25.0,CHI,NBA,SF,15.0,15.0,32.8,6.5,13.4,...,4.6,5.5,2.7,1.2,0.6,1.7,1.9,17.5,"$26,011,913",Jontay Porter
8,2019-20,26.0,CHI,NBA,SF,14.0,9.0,23.6,4.4,10.0,...,2.5,3.4,1.8,1.1,0.4,0.8,2.2,11.9,"$27,250,576",Jontay Porter
9,2020-21,27.0,TOT,NBA,"PF,SF",28.0,6.0,21.6,3.5,8.1,...,4.4,5.4,2.0,0.6,0.1,0.9,1.5,9.7,,Jontay Porter
10,2020-21,27.0,CHI,NBA,SF,25.0,6.0,21.6,3.6,8.1,...,4.4,5.5,2.0,0.5,0.2,1.0,1.5,9.9,,Jontay Porter


In [4]:
salaries_2021 = pd.read_csv('data/salaries_2021.csv', index_col=0)
salaries_2021

Unnamed: 0,Rk,Player,Tm,2020-21,2021-22,2022-23,2023-24,2024-25,2025-26,Signed Using,Guaranteed
0,1,Stephen Curry,GSW,"$43,006,362","$45,780,966",,,,,Bird Rights,"$88,787,328"
1,2,Russell Westbrook,WAS,"$41,358,814","$44,211,146","$47,063,478",,,,Bird Rights,"$85,569,960"
2,3,Chris Paul,PHO,"$41,358,814","$44,211,146",,,,,Bird Rights,"$41,358,814"
3,4,John Wall,HOU,"$41,254,920","$44,310,840","$47,366,760",,,,Bird Rights,"$85,565,760"
4,5,James Harden,BRK,"$40,824,000","$43,848,000","$46,872,000",,,,Bird Rights,"$84,672,000"
...,...,...,...,...,...,...,...,...,...,...,...
558,509,Troy Williams,HOU,"$122,741","$122,741","$122,741",,,,,"$368,223"
559,510,Freddie Gillespie,TOR,"$116,903","$1,517,981",,,,,Minimum Salary,"$116,903"
560,511,Demetrius Jackson,BOS,"$92,857","$92,857","$92,857","$92,857",,,,"$371,428"
561,512,Chimezie Metu,SAC,"$700,000","$1,762,796","$1,910,860",,,,,"$750,000"


# Data Cleaning

## Player on-count performance and salary
Combine the performance of the players and their salary and clean the data

In [5]:
# removes players with 5 or less games played in each season, weird variable types does not allow for .isin() method 
# there are strings, ints, and floats in column 'G', must be done for int, float, and str
main_data = main_data[main_data['G'] != '5']
main_data = main_data[main_data['G'] != '4']
main_data = main_data[main_data['G'] != '3']
main_data = main_data[main_data['G'] != '2']
main_data = main_data[main_data['G'] != '1']

main_data = main_data[main_data['G'] != 5]
main_data = main_data[main_data['G'] != 4]
main_data = main_data[main_data['G'] != 3]
main_data = main_data[main_data['G'] != 2]
main_data = main_data[main_data['G'] != 1]

main_data = main_data[main_data['G'] != 5.0]
main_data = main_data[main_data['G'] != 4.0]
main_data = main_data[main_data['G'] != 3.0]
main_data = main_data[main_data['G'] != 2.0]
main_data = main_data[main_data['G'] != 1.0]

main_data = main_data[main_data['G'] != '5.0']
main_data = main_data[main_data['G'] != '4.0']
main_data = main_data[main_data['G'] != '3.0']
main_data = main_data[main_data['G'] != '2.0']
main_data = main_data[main_data['G'] != '1.0']

main_data.reset_index(drop=True, inplace=True)

main_data

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,SALARY,NAME
0,2017-18,28.0,TOT,NBA,SF,50.0,14.0,14.3,1.8,4.3,...,1.8,1.9,0.6,0.2,0.1,0.4,1.1,5.2,"$1,974,159",Luke Babbitt
1,2017-18,28.0,ATL,NBA,SF,37.0,9.0,15.4,2.2,4.5,...,2.1,2.2,0.7,0.2,0.1,0.4,1.3,6.1,"$1,974,159",Luke Babbitt
2,2017-18,28.0,MIA,NBA,SF,13.0,5.0,11.2,0.8,3.6,...,1.0,1.2,0.4,0.1,0.2,0.3,0.5,2.5,"$1,974,159",Luke Babbitt
3,2017-18,28.0,ATL,NBA,SG,65.0,65.0,27.5,4.4,10.4,...,3.4,3.8,3.5,1.5,0.7,2.4,2.3,12.9,"$16,910,113",Kent Bazemore
4,2018-19,29.0,ATL,NBA,SG,67.0,35.0,24.5,4.1,10.3,...,3.3,3.9,2.3,1.3,0.6,1.8,2.5,11.6,"$18,089,887",Kent Bazemore
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,2018-19,25.0,WAS,NBA,SF,41.0,28.0,29.0,4.9,10.8,...,4.7,5.6,2.0,1.6,0.5,1.0,1.9,12.6,"$26,011,913",Jontay Porter
2495,2018-19,25.0,CHI,NBA,SF,15.0,15.0,32.8,6.5,13.4,...,4.6,5.5,2.7,1.2,0.6,1.7,1.9,17.5,"$26,011,913",Jontay Porter
2496,2019-20,26.0,CHI,NBA,SF,14.0,9.0,23.6,4.4,10.0,...,2.5,3.4,1.8,1.1,0.4,0.8,2.2,11.9,"$27,250,576",Jontay Porter
2497,2020-21,27.0,TOT,NBA,"PF,SF",28.0,6.0,21.6,3.5,8.1,...,4.4,5.4,2.0,0.6,0.1,0.9,1.5,9.7,,Jontay Porter


In [6]:
# find columns that contain null values
null_columns = main_data.columns[main_data.isnull().any()]
null_columns

Index(['POS', '3P%', '2P%', 'FT%', 'SALARY'], dtype='object')

In [7]:
# if a player never shot a free throw, the free throw percentage is null, so need to fill percentage values with 0
main_data['FT%'] = main_data['FT%'].fillna(0)
main_data['2P%'] = main_data['2P%'].fillna(0)
main_data['3P%'] = main_data['3P%'].fillna(0)

In [8]:
main_data[main_data['POS'].isnull() == True]

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,SALARY,NAME
982,2020-21,26.0,TOT,NBA,,15.0,1.0,6.3,0.5,1.7,...,0.3,0.5,0.3,0.5,0.1,0.3,0.3,1.1,,Sindarius Thornwell


In [9]:
# remove player Sindarius Thornwell, gets no minutes and has null position
main_data = main_data.drop(labels = 982, axis = 0)
main_data.reset_index(drop=True, inplace=True)

In [10]:
# drop weird rows which had column names on origin site
salaries_2021.drop(salaries_2021[salaries_2021['2021-22'] == 'Salary'].index,inplace=True)
salaries_2021.drop(salaries_2021[salaries_2021['Rk'] == 'Rk'].index,inplace=True)

salaries_2021_only = salaries_2021.filter(['Player', '2020-21'])
salaries_2021_only.rename(columns={'2020-21' : 'SALARY', 'Player' : 'NAME'}, inplace=True)
salaries_2021_only.reset_index(drop=True, inplace=True)

In [11]:
# splitting data by season
df_2017 = main_data[main_data['SEASON'] == '2017-18']
df_2018 = main_data[main_data['SEASON'] == '2018-19']
df_2019 = main_data[main_data['SEASON'] == '2019-20']
df_2020 = main_data[main_data['SEASON'] == '2020-21']

df_2017.reset_index(drop=True,inplace=True)
df_2018.reset_index(drop=True,inplace=True)
df_2019.reset_index(drop=True,inplace=True)
df_2020.reset_index(drop=True,inplace=True)

df_2017 = df_2017.drop_duplicates(subset='NAME', keep='first')
df_2018 = df_2018.drop_duplicates(subset='NAME', keep='first')
df_2019 = df_2019.drop_duplicates(subset='NAME', keep='first')
df_2020 = df_2020.drop_duplicates(subset='NAME', keep='first')

In [12]:
# combining salaries for season 2020-21
df_2020 = df_2020.drop(columns='SALARY')
df_2020 = df_2020.merge(salaries_2021_only, how='left', on='NAME')

In [13]:
# players with accents in name or do not work with scraper, data will need to be added manually
df_2020[df_2020.isnull().any(axis=1)]

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,NAME,SALARY
5,2020-21,33.0,UTA,NBA,PF,15.0,1.0,9.1,1.3,3.1,...,1.4,1.7,0.3,0.7,0.2,0.5,1.8,4.0,Ersan İlyasova,
9,2020-21,27.0,LAL,NBA,SG,59.0,59.0,32.2,5.5,12.5,...,3.0,3.5,5.8,1.2,0.2,2.7,2.6,15.5,Dennis Schröder,
28,2020-21,28.0,CHI,NBA,C,14.0,0.0,4.2,0.3,0.5,...,0.6,1.0,0.6,0.1,0.0,0.3,0.5,1.0,Cristiano Felício,
41,2020-21,26.0,NOP,NBA,C,41.0,6.0,17.0,2.9,5.1,...,4.2,6.6,0.9,0.4,0.5,0.6,1.3,7.0,Willy Hernangómez,
67,2020-21,25.0,MIN,NBA,PF,47.0,6.0,17.4,2.6,5.8,...,3.1,4.0,0.6,0.4,0.1,0.4,0.9,7.6,Juan Hernangómez,
68,2020-21,25.0,DEN,NBA,C,66.0,66.0,35.0,10.2,18.1,...,8.1,10.9,8.4,1.4,0.7,3.1,2.7,26.4,Nikola Jokić,
86,2020-21,32.0,DAL,NBA,C,28.0,3.0,9.0,2.0,4.0,...,2.8,4.4,0.3,0.1,0.2,0.4,1.0,5.1,Boban Marjanović,
88,2020-21,35.0,PHI,NBA,PF,7.0,0.0,9.6,0.1,1.4,...,0.7,1.1,0.3,0.4,0.3,0.4,0.9,1.1,Anthony Tolliver,
105,2020-21,31.0,UTA,NBA,SF,66.0,66.0,30.6,5.4,12.4,...,3.3,3.8,1.9,0.6,0.1,1.7,1.6,16.3,Bojan Bogdanović,
136,2020-21,34.0,MIA,NBA,PG,45.0,11.0,26.7,4.7,10.8,...,3.0,3.5,4.5,0.7,0.2,2.4,2.2,13.1,Goran Dragić,


In [14]:
df_2017 = df_2017.drop(labels=[647, 648, 649])
df_2018 = df_2018.drop(labels=[300, 361, 567, 694])

In [15]:
df_2017.reset_index(drop=True, inplace=True)
df_2018.reset_index(drop=True, inplace=True)

In [16]:
df_2019.loc[16, 'SALARY'] = '$893,310'
df_2018.loc[89, 'SALARY'] = '$838,464'
df_2020.loc[5, 'SALARY'] = '$1,194,542'
df_2020.loc[9, 'SALARY'] = '$15,500,000'
df_2020.loc[28, 'SALARY'] = '$7,529,020'
df_2020.loc[41, 'SALARY'] = '$1,727,145'
df_2020.loc[67, 'SALARY'] = '$6,493,000'
df_2020.loc[68, 'SALARY'] = '$28,542,009'
df_2020.loc[86, 'SALARY'] = '$3,500,000'
df_2020.loc[88, 'SALARY'] = '$893,310'
df_2020.loc[105, 'SALARY'] = '$17,850,000'
df_2020.loc[136, 'SALARY'] = '$18,000,000'
df_2020.loc[171, 'SALARY'] = '$893,310'
df_2020.loc[180, 'SALARY'] = '$29,467,800'
df_2020.loc[183, 'SALARY'] = '$893,310'
df_2020.loc[200, 'SALARY'] = '$26,000,000'
df_2020.loc[206, 'SALARY'] = '$1,824,003'
df_2020.loc[209, 'SALARY'] = '$8,333,333'
df_2020.loc[225, 'SALARY'] = '$12,000,000'
df_2020.loc[226, 'SALARY'] = '$18,000,000'
df_2020.loc[237, 'SALARY'] = '$15,000,000'
df_2020.loc[253, 'SALARY'] = '$15,000,000'
df_2020.loc[269, 'SALARY'] = '$10,000,000'
df_2020.loc[272, 'SALARY'] = '$893,310'
df_2020.loc[278, 'SALARY'] = '$893,310'
df_2020.loc[283, 'SALARY'] = '$893,310'
df_2020.loc[285, 'SALARY'] = '$8,049,360'
df_2020.loc[290, 'SALARY'] = '$893,310'
df_2020.loc[293, 'SALARY'] = '$893,310'
df_2020.loc[305, 'SALARY'] = '$250,000'
df_2020.loc[309, 'SALARY'] = '$893,310'
df_2020.loc[326, 'SALARY'] = '$893,310'
df_2020.loc[332, 'SALARY'] = '$893,310'
df_2020.loc[336, 'SALARY'] = '$893,310'
df_2020.loc[339, 'SALARY'] = '$893,310'
df_2020.loc[341, 'SALARY'] = '$893,310'
df_2020.loc[342, 'SALARY'] = '$893,310'
df_2020.loc[348, 'SALARY'] = '$893,310'
df_2020.loc[349, 'SALARY'] = '$893,310'
df_2020.loc[353, 'SALARY'] = '$893,310'
df_2020.loc[354, 'SALARY'] = '$893,310'
df_2020.loc[359, 'SALARY'] = '$1,517,981'
df_2020.loc[360, 'SALARY'] = '$893,310'
df_2020.loc[362, 'SALARY'] = '$893,310'
df_2020.loc[363, 'SALARY'] = '$893,310'
df_2020.loc[366, 'SALARY'] = '$893,310'
df_2020.loc[375, 'SALARY'] = '$893,310'
df_2020.loc[378, 'SALARY'] = '$4,097,561'
df_2020.loc[383, 'SALARY'] = '$893,310'
df_2020.loc[394, 'SALARY'] = '$893,310'
df_2020.loc[399, 'SALARY'] = '$893,310'
df_2020.loc[401, 'SALARY'] = '$893,310'
df_2020.loc[405, 'SALARY'] = '$2,824,320'
df_2020.loc[406, 'SALARY'] = '$893,310'
df_2020.loc[411, 'SALARY'] = '$893,310'
df_2020.loc[412, 'SALARY'] = '$893,310'
df_2020.loc[416, 'SALARY'] = '$893,310'
df_2020.loc[417, 'SALARY'] = '$893,310'

In [17]:
def standardize_salary(input_str):
    try:
        string = input_str.strip()
        string = string.strip()
        string = string.replace('$', '')
        string = string.replace(',', '')
        string = string.replace('<', '')
        string = string.replace('(TW)', '')
        string = string.strip()
        string = string.replace('Minimum', '850000')
    except:
        return input_str
    return int(string)

In [18]:
df_2017['SALARY'] = df_2017['SALARY'].apply(standardize_salary)
df_2018['SALARY'] = df_2018['SALARY'].apply(standardize_salary)
df_2019['SALARY'] = df_2019['SALARY'].apply(standardize_salary)
df_2020['SALARY'] = df_2020['SALARY'].apply(standardize_salary)

In [19]:
def del_strs(x):
    try:
        return float(x)
    except:
        return np.nan

In [20]:
df_2017['PTS'] = df_2017['PTS'].apply(del_strs)
df_2018['PTS'] = df_2018['PTS'].apply(del_strs)
df_2019['PTS'] = df_2019['PTS'].apply(del_strs)
df_2020['PTS'] = df_2020['PTS'].apply(del_strs)

In [21]:
df_2017.dropna(axis=0, subset=['PTS'], inplace=True, how='any')
df_2018.dropna(axis=0, subset=['PTS'], inplace=True, how='any')
df_2019.dropna(axis=0, subset=['PTS'], inplace=True, how='any')
df_2020.dropna(axis=0, subset=['PTS'], inplace=True, how='any')

In [22]:
df_2017.reset_index(drop=True,inplace=True)
df_2018.reset_index(drop=True,inplace=True)
df_2019.reset_index(drop=True,inplace=True)
df_2020.reset_index(drop=True,inplace=True)

In [23]:
df_2017

Unnamed: 0,SEASON,AGE,TEAM,LEAGUE,POS,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,SALARY,NAME
0,2017-18,28.0,TOT,NBA,SF,50.0,14.0,14.3,1.8,4.3,...,1.8,1.9,0.6,0.2,0.1,0.4,1.1,5.2,1974159,Luke Babbitt
1,2017-18,28.0,ATL,NBA,SG,65.0,65.0,27.5,4.4,10.4,...,3.4,3.8,3.5,1.5,0.7,2.4,2.3,12.9,16910113,Kent Bazemore
2,2017-18,31.0,TOT,NBA,SG,80.0,2.0,24.3,4.2,9.6,...,1.6,1.9,1.9,0.8,0.2,1.2,1.1,12.1,6306060,Marco Belinelli
3,2017-18,23.0,ATL,NBA,SF,26.0,3.0,17.5,2.0,4.9,...,2.5,2.8,1.9,0.8,0.5,1.8,1.5,5.2,1567200,DeAndre' Bembry
4,2017-18,23.0,ATL,NBA,PF,39.0,1.0,13.3,1.7,3.9,...,2.1,3.3,0.7,0.2,0.1,0.4,1.6,4.7,729919,Tyler Cavanaugh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
459,2017-18,29.0,WAS,NBA,PF,76.0,1.0,18.5,3.6,6.9,...,2.6,3.3,1.1,0.3,0.1,1.0,1.9,8.8,1471382,Mike Scott
460,2017-18,31.0,WAS,NBA,C,33,2,8.6,1.4,3.5,...,1.1,1.6,0.4,0.1,0.4,0.5,1.3,3.4,5225000,Jason Smith
461,2017-18,27.0,WAS,NBA,PG,41,41,34.4,6.8,16.3,...,3.1,3.7,9.6,1.4,1.1,3.9,2.0,19.4,18063850,John Wall
462,2017-18,25.0,NYK,NBA,SG,57.0,54.0,33.1,6.3,15.0,...,3.4,3.9,2.7,1.1,0.2,1.6,2.2,17.5,16500000,Tim Hardaway Jr.


In [24]:
# exporting csv's by season
df_2017.to_csv('data/year_2017.csv')
df_2018.to_csv('data/year_2018.csv')
df_2019.to_csv('data/year_2019.csv')
df_2020.to_csv('data/year_2020.csv')

## Player Tweets
We will assign a dictionary of words with score from 0 to 1, 1 being the word with the strongest political connotation. Every tweet will be stemmed, stripped of punctuation and transition words. Then, we add up the weights of each word in the tweet to get an overall political score of the tweet. In addition, we label the tweet as 1 if its overall political weight is greater than 0. 

This data will be stored as dataframe with rows being the names of current NBA players, columns being the number of political post the player made, proportion of political posts v.s. all posts, average number of retweets and favorite count per post, and the number of followers of the player.

In [25]:
twitter_df1 = pd.read_pickle('data/nba_twitter_data_1.pkl')
twitter_df2 = pd.read_pickle('data/nba_twitter_data_2.pkl')

In [26]:
twitter_df = pd.concat([twitter_df1, twitter_df2])

In [27]:
political_words_list = ['justice', 'political', 'blacklivesmatter', 'alllivesmatter', 'whitelivesmatter', 'bluelivesmatter', 'accountability', 'leftwing', 'rightwing', 'extremist', 'police-brutality', 'policereform',
'george-perry-floyd', 'minneapolis', 'derek-chauvin', 'thomas-lane', 'j-alexander-kueng', 'tou-thao', 'breonna-taylor', 'brett-hankison', 'myles-cosgrove', 'joshua-jaynes', 'elijahmcclain', 'nathanwoodyard', 'jasonrosenblatt',
'randyroedema', 'ahmaudarbery', 'gregorymcmichael', 'travismcmichael', 'williambryan', 'trayvon-martin', 'george-zimmerman', 'patriot', 'icantbreathe', 'saytheirname', 'sayhername', 'sayhisname', 'equality', 'equity',
'discrimination', 'shooting', 'killing', 'murder', 'assassination', 'mental-health', 'racial-profiling', 'stereotypes', 'racist', 'arrest', 'warrant', 'charge', 'unfair', 'trump', 'biden', 'america', 'amerikkka',
'united-states', 'agenda', 'privilege', 'immunity', 'malfeasance', 'irresponsible', 'integrity', 'infringement', 'civil', 'rights', 'violation', 'impunity', 'illegal', 'expose', 'dictatorship', 'authoritarian',
'corruption', 'attack', 'allegation', 'law', 'illegal', 'congress', 'courts', 'trial', 'progressive', 'liberal', 'conservative', 'democrat', 'republican', 'democracy', 'snowflake', 'bootlicker', 'activism', 'protest', 'country', 
'social-justice', 'partisan', 'ideology', 'political', 'oppression', 'leadership', 'anarchy', 'conflict', 'establishment', 'derekchauvin', 'thomas-lane', 'guilty', 'innocent', 'debate', 'coronavirus', 'corona', 'covid',
'covid-19', 'chinese-virus', 'kung-flu', 'pandemic', 'hoax', 'covidiot', 'quarantine', 'mask', 'social-distance', 'lie', 'freedom', 'media', 'press', 'juneteenth', 'unconstitutional', 'empower', 'revolution', 'fascist', 'police',
'cops', 'pigs', 'force', 'terrorism', 'platform', 'power', 'crime', 'criminal-justice-system', 'communist', 'capitalist', 'liberties', 'abuse', 'poc', 'polarization', 'protect', 'violence', 'police', 'brutality', 'breonna']

In [28]:
political_stems = []
stemmer = SnowballStemmer('english')
for word in political_words_list:
    political_stems.append(stemmer.stem(word))

In [29]:
political_stems

['justic',
 'polit',
 'blacklivesmatt',
 'alllivesmatt',
 'whitelivesmatt',
 'bluelivesmatt',
 'account',
 'leftw',
 'rightw',
 'extremist',
 'police-brut',
 'policereform',
 'george-perry-floyd',
 'minneapoli',
 'derek-chauvin',
 'thomas-lan',
 'j-alexander-kueng',
 'tou-thao',
 'breonna-taylor',
 'brett-hankison',
 'myles-cosgrov',
 'joshua-jayn',
 'elijahmcclain',
 'nathanwoodyard',
 'jasonrosenblatt',
 'randyroedema',
 'ahmaudarberi',
 'gregorymcmichael',
 'travismcmichael',
 'williambryan',
 'trayvon-martin',
 'george-zimmerman',
 'patriot',
 'icantbreath',
 'saytheirnam',
 'sayhernam',
 'sayhisnam',
 'equal',
 'equiti',
 'discrimin',
 'shoot',
 'kill',
 'murder',
 'assassin',
 'mental-health',
 'racial-profil',
 'stereotyp',
 'racist',
 'arrest',
 'warrant',
 'charg',
 'unfair',
 'trump',
 'biden',
 'america',
 'amerikkka',
 'united-st',
 'agenda',
 'privileg',
 'immun',
 'malfeas',
 'irrespons',
 'integr',
 'infring',
 'civil',
 'right',
 'violat',
 'impun',
 'illeg',
 'expos',


In [30]:
def tweet_tokenizer(list_in):
    filtered_tweets = []
    stemmer = SnowballStemmer('english')
    stop_words = set(stopwords.words('english'))
    for tweet in list_in:
        try:
            filtered_tweet = []
            tokenized_tweet = word_tokenize(str(tweet))
            for token in tokenized_tweet:
                temp_token = stemmer.stem(token)
                if temp_token in stop_words:
                    continue
                elif temp_token == '@':
                    continue
                elif temp_token == 'rt':
                    continue
                elif temp_token == '#':
                    continue
                else:
                    filtered_tweet.append(temp_token)
            filtered_tweets.append(filtered_tweet)
        except:
            print(str(tweet) + " did not work")
    print('this player is done')
    return filtered_tweets

In [31]:
def political_bools(list_in):
    political_words_list = ['justice', 'political', 'blacklivesmatter', 'alllivesmatter', 'whitelivesmatter', 'bluelivesmatter', 'accountability', 'leftwing', 'rightwing', 'extremist', 'police-brutality', 'policereform',
'george-perry-floyd', 'minneapolis', 'derek-chauvin', 'thomas-lane', 'j-alexander-kueng', 'tou-thao', 'breonna-taylor', 'brett-hankison', 'myles-cosgrove', 'joshua-jaynes', 'elijahmcclain', 'nathanwoodyard', 'jasonrosenblatt',
'randyroedema', 'ahmaudarbery', 'gregorymcmichael', 'travismcmichael', 'williambryan', 'trayvon-martin', 'george-zimmerman', 'patriot', 'icantbreathe', 'saytheirname', 'sayhername', 'sayhisname', 'equality', 'equity',
'discrimination', 'shooting', 'killing', 'murder', 'assassination', 'mental-health', 'racial-profiling', 'stereotypes', 'racist', 'arrest', 'warrant', 'charge', 'unfair', 'trump', 'biden', 'america', 'amerikkka',
'united-states', 'agenda', 'privilege', 'immunity', 'malfeasance', 'irresponsible', 'integrity', 'infringement', 'civil', 'rights', 'violation', 'impunity', 'illegal', 'expose', 'dictatorship', 'authoritarian',
'corruption', 'attack', 'allegation', 'law', 'illegal', 'congress', 'courts', 'trial', 'progressive', 'liberal', 'conservative', 'democrat', 'republican', 'democracy', 'snowflake', 'bootlicker', 'activism', 'protest', 'country', 
'social-justice', 'partisan', 'ideology', 'political', 'oppression', 'leadership', 'anarchy', 'conflict', 'establishment', 'derekchauvin', 'thomas-lane', 'guilty', 'innocent', 'debate', 'coronavirus', 'corona', 'covid',
'covid-19', 'chinese-virus', 'kung-flu', 'pandemic', 'hoax', 'covidiot', 'quarantine', 'mask', 'social-distance', 'lie', 'freedom', 'media', 'press', 'juneteenth', 'unconstitutional', 'empower', 'revolution', 'fascist', 'police',
'cops', 'pigs', 'force', 'terrorism', 'platform', 'power', 'crime', 'criminal-justice-system', 'communist', 'capitalist', 'liberties', 'abuse', 'poc', 'polarization', 'protect', 'violence', 'police', 'brutality', 'breonna']
    
    political_stems = []
    analyser = SentimentIntensityAnalyzer()
    stemmer = SnowballStemmer('english')
    political_tweets_bool = []
    need_append = False
    for word in political_words_list:
        political_stems.append(stemmer.stem(word))
        political_tweets = []
    for i in range(0, len(list_in)):
        need_append = False
        for token in list_in[i]:
            if token in political_stems:
#                 if analyser.polarity_scores(' '.join(tweet))['neg'] > .4:
                need_append = True
#                 political_tweets_bool.append(True)
        if need_append:
            political_tweets_bool.append(True)
        else:
            political_tweets_bool.append(False)
    return political_tweets_bool

In [32]:
def count_political(list_in):
    counter = 0
    for tweet in list_in:
        if tweet:
            counter += 1
    return counter

In [33]:
twitter_df['filtered_tweets'] = twitter_df['full_text'].apply(tweet_tokenizer)

this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done


this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done
this player is done


In [34]:
twitter_df['political_bools'] = twitter_df['filtered_tweets'].apply(political_bools)

In [35]:
twitter_df['num_political'] = twitter_df['political_bools'].apply(count_political)

In [36]:
twitter_df[twitter_df['name'] == 'LeBron James']

Unnamed: 0,name,screen_name,followers_count,statuses_count,account_created_at,full_text,created_at,favorite_count,retweet_count,retweeted,id,possibly_sensitive,filtered_tweets,political_bools,num_political
330,LeBron James,KingJames,49729987.0,7710.0,2009-03-06 16:25:53+00:00,[Every morning you have a choice… Snooze 😴 or ...,"[2021-05-13 17:08:17+00:00, 2021-05-12 20:52:4...","[10116, 89447, 12602, 2010, 83684, 4553, 4580,...","[1030, 7349, 832, 50, 7314, 268, 279, 16381, 1...","[False, False, False, False, False, False, Fal...","[1392889448178753538, 1392583560327008260, 139...","[False, nan, False, nan, nan, False, False, Fa...","[[everi, morn, choice…, snooz, 😴, rise, 💪🏾, .,...","[False, False, True, False, False, False, Fals...",263


In [37]:
political_tweets = []
political_tweets_tokens = []
for i in range(0, len(twitter_df['full_text'].iloc[330])):
    if twitter_df['political_bools'].iloc[330][i]:
        political_tweets.append(twitter_df['full_text'].iloc[330][i])
        political_tweets_tokens.append(twitter_df['filtered_tweets'].iloc[330][i])
political_tweets

['Sue, Aja and DT were showing OUT at media day and I LOVE IT!! 🔥🔥🔥🔥🔥🔥🔥🔥🔥. https://t.co/TCmW68ofmo',
 '@CuffsTheLegend Literally listening to this album right now bro!! My GOODNESS!! 🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥 #Timeless',
 'Protect our Young Black Women &amp; Men! 🙏🏾✊🏾🤎👑',
 'I fueled the wrong conversation about Ma’Khia Bryant and I owe it to her and this movement to change it. Thank you @fabiolacineas for educating us about Ma’Khia and her story and why this needs to be about her. https://t.co/Owh1vDJWXi\xa0\xa0\xa0#sayhername #Blacklivesmatter',
 'Man I miss @kendricklamar! In dying need of his gifts/presence right now 🎶 🖊 📄.',
 'RT @morethanavote: If you think the activism around George Floyd’s murder was just about getting a verdict, you missed the point.\n\nThis is…',
 'I’m so damn tired of seeing Black people killed by police.\xa0I took the tweet down because its being used to create more hate -This isn’t about one officer.\xa0\xa0it’s about the entire system and they always use our words to c

In [38]:
analyser = SentimentIntensityAnalyzer()
output = pd.DataFrame()
for tweet in twitter_df['filtered_tweets'].iloc[330]:
    joined_sen = ' '.join(tweet)
    ss = analyser.polarity_scores(joined_sen)
    ss['sentence'] = tweet
    output = output.append(ss, ignore_index=True)
output

Unnamed: 0,compound,neg,neu,pos,sentence
0,0.0000,0.000,1.000,0.000,"[everi, morn, choice…, snooz, 😴, rise, 💪🏾, ., ..."
1,0.6988,0.000,0.692,0.308,"[get, close, ’, lineup, tonight, import, put, ..."
2,0.6988,0.000,0.653,0.347,"[sue, ,, aja, dt, show, media, day, love, !, !..."
3,0.6884,0.000,0.514,0.486,"[djmeel, one, top, 3, fav, show, !, might, eve..."
4,0.0000,0.000,1.000,0.000,"[tht, !, !, !, !, !, !, !, !, !, !, !, !, !, !..."
...,...,...,...,...,...
2995,-0.7108,0.397,0.603,0.000,"[rickybobbyvip, u, must, n't, saw, ., love, fa..."
2996,-0.2244,0.147,0.853,0.000,"[know, say, miss, y'all, ., let, 's, go, anoth..."
2997,0.7712,0.000,0.428,0.572,"[ok, 's, 30, min, !, love, talk, y'all]"
2998,0.3612,0.000,0.706,0.294,"[stnorton24, crib, like, 10, min, away, aaa]"


In [39]:
output[output['neg'] > .5]

Unnamed: 0,compound,neg,neu,pos,sentence
31,-0.6996,0.537,0.463,0.0,"[man, miss, kendricklamar, !, die, need, gifts..."
78,-0.8672,0.522,0.478,0.0,"[would, rather, war, peac, world, peac, war, w..."
165,-0.7896,0.7,0.3,0.0,"[courtsid, karen, mad, mad, !, !, 🤣🤣🤣🤣🤣🤣🤣]"
340,-0.7405,0.548,0.452,0.0,"[swear, ’, man, !, !, !, problem, !, 🤣🤣🤣🤣🤣🤣🤣, ..."
370,-0.6421,0.68,0.32,0.0,"[cuffsthelegend, never, !, !, !, greatest]"
460,-0.3348,0.543,0.457,0.0,"[hunt, tough, !, !, !, 😤😤]"
518,-0.858,0.826,0.174,0.0,"[fuck, man, !, !, !, !, demand, chang, ., sick]"
787,-0.5255,0.517,0.263,0.219,"[tiktok/kid, +, =, look, like, damn, fool, !, ..."
793,-0.7115,0.581,0.419,0.0,"[bjwhite, rebel, without, caus, !, lol, ., ’, ..."
811,-0.6671,0.578,0.422,0.0,"[damn, ,, realli, miss, sport, !, !, !, !, !, ..."


In [40]:
twitter_df

Unnamed: 0,name,screen_name,followers_count,statuses_count,account_created_at,full_text,created_at,favorite_count,retweet_count,retweeted,id,possibly_sensitive,filtered_tweets,political_bools,num_political
0,Álex Abrines,alexabrines,54283.0,3118.0,2011-06-30 20:21:21+00:00,"[@BertranX @ganchetto @KaskyCascante Bueno, fa...","[2021-05-16 08:49:29+00:00, 2021-05-16 08:28:3...","[2, 3, 180, 186, 1, 3, 28, 41, 0, 52, 156, 263...","[0, 0, 1, 16, 0, 0, 1, 0, 1014, 1, 6, 26, 7, 7...","[False, False, False, False, False, False, Fal...","[1393851084444217344, 1393845803412819969, 139...","[nan, False, False, nan, nan, nan, False, Fals...","[[bertranx, ganchetto, kaskycascant, bueno, ,,...","[False, False, False, False, False, False, Fal...",32
1,Quincy Acy,QuincyAcy,33445.0,16231.0,2009-04-10 16:04:23+00:00,[RT @Ballislife: The NBA needs more players th...,"[2021-05-15 03:40:21+00:00, 2021-05-14 21:47:3...","[0, 0, 2, 1, 22, 1, 0, 259, 0, 0, 0, 0, 5, 9, ...","[541, 100, 0, 0, 0, 0, 3, 13, 593, 22, 7875, 1...","[False, False, False, False, False, False, Fal...","[1393410898027433986, 1393322139722334208, 139...","[False, nan, False, False, False, nan, False, ...","[[ballislif, :, nba, need, player, think, like...","[False, False, False, False, False, False, Fal...",194
2,Jaylen Adams,10jadams,4870.0,4721.0,2012-12-23 16:42:50+00:00,[RT @ComplexMusic: THE REAL IS BACK.\n\nJ. COL...,"[2021-05-14 05:48:18+00:00, 2021-05-09 21:17:0...","[0, 0, 0, 0, 0, 2, 0, 0, 0, 5, 0, 0, 0, 28, 0,...","[3794, 3, 32212, 6457, 7516, 0, 3000, 6630, 20...","[False, False, False, False, False, False, Fal...","[1393080711364681729, 1391502508426764292, 138...","[False, False, nan, False, False, nan, nan, na...","[[complexmus, :, real, back, ., j., cole, ., t...","[False, False, False, False, False, False, Fal...",121
3,Steven Adams,RealStevenAdams,216669.0,1642.0,2013-03-09 03:15:09+00:00,[RT @StationCDRKelly: Every day is #EarthDay e...,"[2021-04-22 20:42:30+00:00, 2021-02-18 20:00:0...","[0, 0, 0, 345, 54657, 0, 758, 0, 1648, 392, 58...","[812, 3321, 34, 10, 4306, 13692, 25, 10, 113, ...","[False, False, False, False, False, False, Fal...","[1385333211237031938, 1362492093051068416, 135...","[False, nan, nan, False, False, False, False, ...","[[stationcdrkelli, :, everi, day, earthday, ev...","[False, False, False, False, False, False, Fal...",64
4,Bam Adebayo,Bam1of1,180603.0,14228.0,2012-04-17 01:29:18+00:00,[Just Sad Bro.... @DwyaneWade @spidadmitchell ...,"[2021-05-12 03:19:45+00:00, 2021-05-05 22:47:4...","[3573, 1000, 3060, 0, 17586, 0, 3756, 7299, 0,...","[208, 32, 392, 847, 935, 690, 546, 484, 41, 78...","[False, False, False, False, False, False, Fal...","[1392318550451363842, 1390075752629030912, 138...","[False, False, False, False, nan, False, False...","[[sad, bro, ...., dwyanewad, spidadmitchel, ht...","[False, False, False, False, False, False, Fal...",92
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
699,Cody Zeller,CodyZeller,144182.0,1057.0,2011-03-01 00:52:17+00:00,[8 years ago I took this picture with Preston ...,"[2021-05-15 00:45:58+00:00, 2021-04-20 14:31:0...","[257, 0, 0, 0, 16, 184, 27, 0, 0, 180, 561, 0,...","[6, 1, 10, 20, 1, 10, 0, 10, 5, 5, 38, 141, 12...","[False, False, False, False, False, False, Fal...","[1393367013234466816, 1384514945795170307, 138...","[False, nan, nan, nan, False, False, False, na...","[[8, year, ago, took, pictur, preston, howel, ...","[False, False, False, False, False, False, Fal...",99
700,Tyler Zeller,ZellerTyler,45338.0,121.0,2012-04-09 22:05:06+00:00,[RT @CodyZeller: Wearing #40 has always been s...,"[2016-06-02 16:06:53+00:00, 2016-01-30 17:30:3...","[0, 33, 0, 0, 87, 24, 41, 54, 34, 0, 181, 41, ...","[98, 4, 11, 19, 20, 20, 18, 19, 9, 7, 72, 15, ...","[False, False, False, False, False, False, Fal...","[738401511034609664, 693486467939700736, 61527...","[nan, False, nan, False, False, nan, nan, nan,...","[[codyzel, :, wear, 40, alway, special, becaus...","[False, False, False, False, False, False, Fal...",10
701,Paul Zipser,paulzzip,1463.0,261.0,2017-10-19 13:34:00+00:00,"[Uffff 😂😂😂😂😂 https://t.co/YpMBkSInp7, @RolandL...","[2021-05-08 20:50:49+00:00, 2021-04-29 09:43:2...","[6, 5, 0, 1, 8, 0, 27, 0, 8, 1, 0, 2, 2, 41, 2...","[0, 0, 25, 0, 0, 11, 1, 4, 0, 0, 4250, 0, 0, 0...","[False, False, False, False, False, False, Fal...","[1391133509608251394, 1387704053854089216, 138...","[False, nan, nan, nan, False, nan, False, nan,...","[[uffff, 😂😂😂😂😂, https, :, //t.co/ypmbksinp7], ...","[False, False, False, False, False, False, Fal...",9
702,Ante Žižić,Ante__Zizic,140.0,7.0,2017-12-10 01:39:23+00:00,[@I_Am_Thunder32 Checks still hit account the ...,"[2019-03-09 04:25:41+00:00, 2019-03-09 02:11:2...","[1, 0, 0, 0, 3, 1, 1]","[0, 0, 0, 0, 0, 0, 1]","[False, False, False, False, False, False, False]","[1104236790381199361, 1104203005568663552, 110...","[nan, nan, nan, nan, nan, nan, nan]","[[i_am_thunder32, check, still, hit, account],...","[True, False, False, False, False, False, False]",1


In [41]:
twitter_df_new1 = twitter_df.iloc[:200, :]
twitter_df_new2 = twitter_df.iloc[200:380, :]
twitter_df_new3 = twitter_df.iloc[380:550, :]
twitter_df_new4 = twitter_df.iloc[550:, :]

In [42]:
twitter_df_new1.to_pickle('data/twitter_political_1.pkl')
twitter_df_new2.to_pickle('data/twitter_political_2.pkl')
twitter_df_new3.to_pickle('data/twitter_political_3.pkl')
twitter_df_new4.to_pickle('data/twitter_political_4.pkl')

In [None]:
twitter_df_test1 = pd.read_pickle('data/twitter_political_1.pkl')
twitter_df_test2 = pd.read_pickle('data/twitter_political_2.pkl')
twitter_df_test3 = pd.read_pickle('data/twitter_political_3.pkl')
twitter_df_test4 = pd.read_pickle('data/twitter_political_4.pkl')

In [None]:
twitter_df_complete = pd.concat([twitter_df_test1, twitter_df_test2, twitter_df_test3, twitter_df_test4])
twitter_df_complete

In [None]:
twitter_df[['full_text', 'created_at', 'favorite_count', 'retweet_count', 'retweeted', 'id', 'possibly_sensitive']].iloc[0].apply(lambda x: len(x))

In [None]:
twitter_df.dtypes

## Twitter data preprocessing (David, Quinton)\

filter the tweets to 2017, Jan 1 - present
- first approach
(1) parse the tweet by lowercasing, examine if the phrase in politcal word dictionary (add up the weights, assign the weights in the dictionary from 0-1)

- second approach
(1) stem approach and calculate the approach <br> (proportion of the length of text)

(2) columns with positive sentiment score, negative, neutral, and compound <br>
(3) dictionary that has political words <br>

if has political word
\tif > negative sentiment_threshhold
\tmark political
      

In [None]:
np.array(twitter_df['favorite_count'].values).shape

## Data Visualization for the player data (Alex, Lauren, Michael)
i.e. salary, player stats, individual player highlight (i.e. Lebron James, Damian Lilard)

### Lebron james pd.scatter_matrix() of everything in the Twitter dataset (Alex)

### pd.scatter_matrix() everything in the main dataset (Alex)

### scatter plot, of number of points of all players  versus the overall number of political tweets for 2020 (Lauren)

### Scatter plot of twitter followers of all players on x-axis, and salary on y-axis (influence of player on their salary) (Lauren)

### barplot of 10 players that has the most political tweets on the x-axis (Lauren)

### grouped barplot by player of political tweets over the period of 2017-2021 for the top 3 earning players as of 2020 (Michael)

### bi-boxplot: average of favorite count of nonpolitcal/political tweets for all players over 2017-2021 (Michael)

### scatter plot of scoring of player vs number of political tweets for 2020 (Michael)