# Collecting and Cleaning Twitter Usernames for Athletes

This notebook collects and prepares Twitter usernames for athletes across a number of sports. It involved a SPARQL endpoint for soccer players, merges data from multiple sports, and cleans the data to create a dataset of athlete usernames.

The final `athletes` dataset includes athlete information. 
- `name`: Full name of the athlete
- `username`: Athlete's Twitter username
- `sport`: Athlete's sport

In [1]:
%pip install unidecode SPARQLWrapper

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try brew install
[31m   [0m xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a Python library that isn't in Homebrew,
[31m   [0m use a virtual environment:
[31m   [0m 
[31m   [0m python3 -m venv path/to/venv
[31m   [0m source path/to/venv/bin/activate
[31m   [0m python3 -m pip install xyz
[31m   [0m 
[31m   [0m If you wish to install a Python application that isn't in Homebrew,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. You can install pipx with
[31m   [0m 
[31m   [0m brew install pipx
[31m   [0m 
[31m   [0m You may restore the old behavior of pip by passing
[31m   [0m the '--break-system-packages' flag to pip, or by adding
[31m   [0m 'break-system-packag

In [3]:
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON

## Collect Soccer Player Usernames

In [71]:
# setup SPARQL endpoint and write query
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
query = """
SELECT ?item ?itemLabel ?twitter
WHERE 
{
  ?item wdt:P106 wd:Q937857. # Occupation: football player
  OPTIONAL{?item wdt:P2002 ?twitter}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 1000
"""
sparql.setQuery(query)
sparql.setReturnFormat(JSON)

In [72]:
# execute query and parse results
try:
    results = sparql.query().convert()
except Exception as e:
    print(f"Error executing query: {e}")
    results = None

### Process Query Results

In [73]:
# process results if query was successful
if results:
    data = []
    # get relevant fields
    for result in results["results"]["bindings"]:
        item = result["item"]["value"]
        item_label = result.get("itemLabel", {}).get("value", None)
        twitter = result.get("twitter", {}).get("value", None)
        data.append({"item": item, "item_label": item_label, "twitter": twitter})

    # convert to dataframe and save to csv
    soccer_players = pd.DataFrame(data)
    soccer_players.drop(columns=['item'])
    soccer_players.to_csv("soccer_players_twitter.csv", index=False)

    # filter out rows with missing usernames
    soccer_players = soccer_players[soccer_players['twitter'].notna()]
    print(soccer_players.head())
else:
    print("No results.")

                                    item            item_label          twitter
2    http://www.wikidata.org/entity/Q615          Lionel Messi   fundacionmessi
3    http://www.wikidata.org/entity/Q624  Alessandro Del Piero      delpieroale
9   http://www.wikidata.org/entity/Q1885        Jürgen Locadia  locadiaofficial
10  http://www.wikidata.org/entity/Q1894         Memphis Depay          Memphis
11  http://www.wikidata.org/entity/Q1894         Memphis Depay          memphis


## Read and Merge Data

In [7]:
# load data
baseball_players = pd.read_csv('../data/usernames/baseball_accounts.txt', sep=' - ')
basketball_players = pd.read_csv('../data/usernames/basketball_accounts.txt')
football_players = pd.read_csv('../data/usernames/football_accounts.txt', sep=' - ')
hockey_players = pd.read_csv('../data/usernames/hockey_accounts.txt', sep=' - ')
soccer_players = pd.read_csv('../data/usernames/soccer_accounts.csv')

In [8]:
# view column names
print(f'baseball: {baseball_players.columns}\
        \nbasketball:{basketball_players.columns}\
        \nfootball: {football_players.columns}\
        \nhockey: {hockey_players.columns}\
        \nsoccer: {soccer_players.columns}')

baseball: Index(['name', 'username'], dtype='object')        
basketball:Index(['Rk', 'Player', 'Twitter'], dtype='object')        
football: Index(['name', 'username'], dtype='object')        
hockey: Index(['name', 'username'], dtype='object')        
soccer: Index(['item', 'item_label', 'twitter'], dtype='object')


## Data Cleaning

In [9]:
# drop unnecessary columns
basketball_players = basketball_players.drop(columns=['Rk'])
soccer_players = soccer_players.drop(columns=['item'])

# rename columns for consistency
basketball_players = basketball_players.rename(columns={'Player': 'name', 'Twitter': 'username'})
soccer_players = soccer_players.rename(columns={'item_label': 'name', 'twitter': 'username'})

In [10]:
# print updated columns
print(f'basketball:{basketball_players.columns}')
print(f'soccer: {soccer_players.columns}')

basketball:Index(['name', 'username'], dtype='object')
soccer: Index(['name', 'username'], dtype='object')


### Merge Data

In [78]:
# add indicators for each sport
baseball_players['sport'] = 'baseball'
basketball_players['sport'] = 'basketball'
football_players['sport'] = 'football'
hockey_players['sport'] = 'hockey'
soccer_players['sport'] = 'soccer'

In [11]:
# concatenate all data
athletes = pd.concat([baseball_players, basketball_players, football_players, hockey_players, soccer_players], ignore_index=True)
athletes

Unnamed: 0,name,username
0,David Aardsma,@TheDA53
1,Henry Aaron,@HenryLouisAaron
2,Andrew Abbott,@andrewabbott33
3,Cory Abbott,@Cabbott40
4,Jim Abbott,@jabbottum31
...,...,...
8530,Jan Sobol,
8531,Antonio Salazar,
8532,Ricardo van Rhijn,
8533,Ian Holloway,


### Drop Missing Values

In [80]:
# identify rows with missing usernames
athletes[athletes['username'].isna()]

Unnamed: 0,name,username,sport
3085,Bubbles Hawkins,,basketball
3086,Mo Howard,,basketball
3090,Dermie O'Connell,,basketball
3092,Tal Skinner,,basketball
3094,Bobby Watson,,basketball
...,...,...,...
8530,Jan Sobol,,soccer
8531,Antonio Salazar,,soccer
8532,Ricardo van Rhijn,,soccer
8533,Ian Holloway,,soccer


In [81]:
# drop rows with no username
athletes = athletes.dropna(subset=['username'])

# verify no null values
athletes.isna().sum()

name        0
username    0
sport       0
dtype: int64

### Ensure username consistency

In [82]:
# remove @ from all usernames
athletes['username'] = athletes['username'].str.replace('@', '')

athletes

Unnamed: 0,name,username,sport
0,David Aardsma,TheDA53,baseball
1,Henry Aaron,HenryLouisAaron,baseball
2,Andrew Abbott,andrewabbott33,baseball
3,Cory Abbott,Cabbott40,baseball
4,Jim Abbott,jabbottum31,baseball
...,...,...,...
8501,Wesley Sneijder,sneijder101010,soccer
8518,Carlos Cuéllar,Cuellar24,soccer
8520,Simon Mignolet,SMignolet,soccer
8525,Winston Bogarde,WinstonBogarde5,soccer


## Final Cleaned DataFrame

In [83]:
athletes.to_csv('../data/usernames/accounts_final.csv', index=False)