# Tabular cleaning

### Cleaning NBA combine data

I cleaned this data with the intent to merge it with NBA player season data. The first thing I had to do with the NBA combine data was adjust the format of the player name column. I changed the format from (Last, First) to (First Last) using strsplit and sapply in R. I also changed the name of the column from "PLAYER" to "Name" so that can merge the dataset by "Name". I then renamed some other columns to be more clear, and dropped those that were not needed. This dataset was relatively clean to begin with, so only a few minor adjustments were needed. 

[Link to code](https://github.com/anly501/dsan-5000-project-thm12/blob/main/codes/01-data-gathering/data_gathering%26cleaning.Rmd)

[Link to data](https://github.com/anly501/dsan-5000-project-thm12/blob/main/data/01-modified-data/cleaned_nba_combine.csv)


### Cleaning NBA player season 

I cleaned and subsetting the  NBA player season data to merge it with the combine data. The dataset had duplicate player values for each season. I subsetted the data to only keep one value for each player in order to merge with the NBA combine dataset. I wanted to keep the version of each player in which they had their best season. I did this by keeping the version of each player that had the the most points for the season. Points are generally the most important stat in basketball, and while assists and rebounds are important, total points should be a good indicator of a season in which a player had one of their best years and remained healthy enough to play most of the season. I then deleted a pleathura of columns that would not be useful in indicating in game performance.

[Link to code](https://github.com/anly501/dsan-5000-project-thm12/blob/main/codes/01-data-gathering/data_gathering%26cleaning.Rmd)

[Link to data](https://github.com/anly501/dsan-5000-project-thm12/blob/main/data/01-modified-data/cleaned_best_NBA_season_player.csv)

### Merging NBA combine and player season data

I then merged the subset player season dataset with the NBA combine dataset. I used a full join by Player name in R. I then removed values that did not have combine data by removing rows that did not have a value for "combine_year".

[Link to code](https://github.com/anly501/dsan-5000-project-thm12/blob/main/codes/01-data-gathering/data_gathering%26cleaning.Rmd)

[Link to data](https://github.com/anly501/dsan-5000-project-thm12/blob/main/data/01-modified-data/cleaned_NBA_combined.csv)

### Cleaning Olympic Track and Field data

The Olympic Track and Field dataset was already pretty clean, so I just created a subset that only included the High jump event. I then renamed the result column to Best Height (m).

[Link to data](https://github.com/anly501/dsan-5000-project-thm12/blob/main/data/01-modified-data/cleaned_high_jump.csv)

In [1]:
import pandas as pd
import numpy as np

#subset track data
olympic_track= pd.read_csv("../../data/00-raw-data/olympic_track.csv")
high_jump = olympic_track[olympic_track["Event"].str.contains("High Jump", case=False, na=False)]
high_jump = high_jump.rename(columns={"Result": " Best Height (m)"})

high_jump.to_csv("../../data/01-modified-data/cleaned_high_jump.csv")



     Gender            Event     Location  Year Medal                  Name  \
1199      M    High Jump Men          Rio  2016     G          Derek DROUIN   
1200      M    High Jump Men          Rio  2016     S    Mutaz Essa BARSHIM   
1201      M    High Jump Men          Rio  2016     B     Bohdan BONDARENKO   
1202      M    High Jump Men      Beijing  2008     G         Andrey SILNOV   
1203      M    High Jump Men      Beijing  2008     S        Germaine MASON   
...     ...              ...          ...   ...   ...                   ...   
2204      W  High Jump Women       London  1948     S          Dorothy ODAM   
2205      W  High Jump Women       London  1948     B  Micheline OSTERMEYER   
2206      W  High Jump Women  Los Angeles  1932     G           Jean SHILEY   
2207      W  High Jump Women  Los Angeles  1932     S     Mildred DIDRIKSON   
2208      W  High Jump Women  Los Angeles  1932     B             Eva DAWES   

     Nationality  Best Height (m)  Unnamed: 8  
119

### Cleaning NFL combine data

The NFL combine data was also already pretty clean, so I only made a few adjustments. I discarded values that did not have a result for standing vertical jump. Some prospects opt out of certain events and tests, but I am really only interested at looking at vertical jump for comparison to NBA prospests, so the prospects who did not test for that event were removed from the dataset. I then disarded values that served as dataset IDs and therefore had no significance for analysis. I finally changed the column name for "Vertical" to "STANDING.VERTICAL" so it will be easier to join and conduct analysis with the NBA combine dataset in the future.


[Link to code](https://github.com/anly501/dsan-5000-project-thm12/blob/main/codes/01-data-gathering/data_gathering%26cleaning.Rmd)

[Link to data](https://github.com/anly501/dsan-5000-project-thm12/blob/main/data/01-modified-data/cleaned_NFL_combine.csv)

### Cleaning Stretching Study data

For the stretching data I changed the column "Serial\n No." to "Participant number", and then changed the values in the gender column to Male or Female rather than 1 or 2.


[Link to data](https://github.com/anly501/dsan-5000-project-thm12/blob/main/data/01-modified-data/cleaned_stretching.csv)

In [2]:
#clean stretching data
stretching= pd.read_csv("../../data/00-raw-data/stretching.csv")
stretching = stretching.rename(columns={"Serial\n No.": "Participant number"})
stretching["Gender"] = stretching["Gender"].replace({1: "Male", 2: "Female"})

stretching.to_csv("../../data/01-modified-data/cleaned_stretching.csv")


   Participant number  Group  Age  Gender  Height ( Cm )  Weight ( Kg )   BMI  \
0                   1      1   21    Male         172.72             68  22.8   
1                   2      1   23    Male         190.50             98  27.0   
2                   3      1   21    Male         180.30             74  22.8   
3                   4      1   19  Female         162.50             70  27.2   
4                   5      1   20  Female         162.50             52  19.7   

   Vertical jump\n height ( Pre ) ( Cm )  \
0                                     39   
1                                     32   
2                                     49   
3                                     24   
4                                     27   

   Vertical jump\n height ( Post ) ( Cm )  
0                                    37.5  
1                                    35.0  
2                                    48.0  
3                                    24.0  
4                           

# Text cleaning

### Wikipedia APi data cleaning


I cleaned the text data by turning the text into a corpus and list of tokens, I then cleaned the tokens by lemmatizing them and removing digits. I also added written out numbers to the list of stopwords. I then vectorized the tokens and created a sparse matix of the tokens as well as an array.


In [6]:
#Wikipedia api
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#import text data
with open('wikipedia_text.txt', 'r') as file:
    text = file.read()

#create tokens and corpus
tokens = word_tokenize(text)
corpus = nltk.tokenize.sent_tokenize(text)


#lemmetization FIX TO ADD WORDS BACK TO SEPERATE LIST
lemmatized_tokens = []
for token in tokens:
    lemmatized_token = lemmatizer.lemmatize(token)
    lemmatized_tokens.append(lemmatized_token)

tokens = lemmatized_tokens

# remove digits FIX TO ADD WORDS BACK TO SEPERATE LIST
def remove_digits(my_string):
  clean_string = ''.join([c for c in my_string if not c.isdigit()])
  return clean_string

no_digit_tokens = []
for token in tokens:
   no_digit_token = remove_digits(token)
   no_digit_tokens.append(no_digit_token)
tokens = no_digit_tokens

#add stopwords
more_stopwords = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]
more_stopwords.extend(stopwords.words('english'))

#put tokens into vectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b', lowercase=True, stop_words= more_stopwords)
Xs = vectorizer.fit_transform(tokens)   
print(type(Xs))

# VOCABULARY DICTIONARY
print("vocabulary = ",vectorizer.vocabulary_)   

# col_names
col_names=vectorizer.get_feature_names_out()
print("col_names=",col_names)

print("SPARSE MATRIX\n",Xs)
X=np.array(Xs.todense())
print(X)




<class 'scipy.sparse._csr.csr_matrix'>
vocabulary =  {'basketball': 146, 'team': 1602, 'sport': 1511, 'commonly': 292, 'player': 1199, 'opposing': 1111, 'another': 78, 'rectangular': 1308, 'court': 339, 'compete': 296, 'primary': 1242, 'objective': 1087, 'shooting': 1450, 'approximately': 85, 'inch': 783, 'cm': 268, 'diameter': 413, 'defender': 384, 'hoop': 753, 'basket': 145, 'mounted': 1037, 'foot': 613, 'high': 738, 'backboard': 129, 'end': 500, 'preventing': 1240, 'field': 584, 'goal': 682, 'worth': 1793, 'point': 1204, 'unless': 1698, 'made': 953, 'behind': 152, 'line': 926, 'foul': 627, 'timed': 1638, 'play': 1197, 'stop': 1548, 'fouled': 628, 'designated': 400, 'shoot': 1448, 'technical': 1606, 'given': 675, 'free': 635, 'throw': 1629, 'game': 652, 'win': 1780, 'regulation': 1324, 'expires': 553, 'score': 1403, 'tied': 1634, 'additional': 24, 'period': 1175, 'overtime': 1136, 'mandated': 967, 'players': 1200, 'advance': 27, 'ball': 137, 'bouncing': 183, 'walking': 1748, 'running

### News API cleaning


For this text data I created a string cleaning function did a varitey of things to the text, including replacing multiple copies of punctuation and extra spaces with a single space, removing specific characters like quotes, eliminating duplicate whitespace characters, and converting all the text to lowercase. I applied the function to clean text data from a list of articles, specifically in the "title" and "description" fields. The cleaned text was then stored in the "cleaned_data" list.

In [39]:
#news API

import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='306919e989964d6ba9f61a0b153c64ba'
TOPIC='Basketball'

#Query
URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}


response = requests.get(baseURL, URLpost) #request data from the server
response = response.json() #extract txt data from request into json


from datetime import datetime
timestamp = datetime.now().strftime("%Y-%m-%d-H%H-M%M-S%S")
 
with open(timestamp+'-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4)

#cleaning funtion
def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+  
                    \ *           
                    """,
                    " ",          
                    input_string, flags=re.VERBOSE)

        out = re.sub('[’.]+', '', input_string)

        out = re.sub(r'\s+', ' ', out)

        out=out.lower()
    except:
        print("ERROR")
        out=''
    return out

#cleaning

article_list=response['articles']  
article_keys=article_list[0].keys()
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    

    for key in article_keys:
        

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        if(key=='description'):
             tmp.append(string_cleaner(article[key]))


    cleaned_data.append(tmp)
    index+=1