![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Data Science Hackathon

As our world becomes increasingly data-driven, the ability to analyze, visualize, and draw insights from large and complex datasets is becoming an essential skill. Data science can provide you with the tools to make better decisions and solve complex problems.

Click on the cell below, then click the Run button above to import and display a dataframe about *hypothetical* pets for adoption from our friends at [Bootstrap](https://www.bootstrapworld.org/materials/data-science).

In [None]:
import pandas as pd
import pyodide
pets = pd.read_csv(pyodide.http.open_url('https://raw.githubusercontent.com/callysto/hackathon/master/PrepMaterials/pets.csv'))
pets

To create visualizations, we will use [Plotly Express](https://plotly.com/python/plotly-express). Click on the cell below then click Run to display a `bar` graph.

In [None]:
import piplite
await piplite.install(['nbformat','plotly','statsmodels'])
import plotly.express as px
px.bar(pets, x='Name', y='Age (years)', title='Pets Ages')

## Beginner Challenges

Each of these challenges is worth 2 points, and uses the `pets` dataframe.

1. Make a bar graph with `Name` on the x-axis and `Legs` on the y-axis.

2. Make a bar graph using the following code.
```
x='Species', y='Age (years)', color='Gender'
```

3. Recreate the previous bar graph, but add the following code.
```
, barmode='group'
```

4. Make a line graph by changing `bar` to `line` from the example below.
```
px.bar(pets, x='Name', y='Age (years)', title='Pets Ages')
```

5. Make a scatter plot with the code below.
```
px.scatter(pets, x='Name', y='Age (years)')
```

6. Recreate the previous scatter plot, but add a title.

7. Recreate the previous scatter plot and color the points by `Gender`.

8. Make a scatter plot comparing `Age (years)` with `Weight (lbs)`.

9. Recreate the previous scatter plot, but add.
```
, size='Time to Adoption (weeks)'
```

10. Make a [histogram](https://plotly.com/python/histograms) (like a bar graph showing calculated statistics) using the following code.
```
px.histogram(pets, x='Species', y='Weight (lbs)', title='Total Weight by Species')
```

11. Recreate the histogram, adding the following parameter to calculate average instead of totals.
```
, histfunc='avg'
```

12. Sort the `pets` dataframe by `Age (years)` using the following code.
```
pets.sort_values('Age (years)')
```

13. Sort the `pets` dataframe by age in descending order by adding the following.
```
, ascending=False
```

14. Make a bar graph of the sorted dataframe by replacing `pets`from one of the bar graph examples with `pets.sort_values('Age (years)')`.

15. Display just one column of the dataframe using the following.
```
pets['Species']
```

17. Display just the dogs using the following.
```
pets[pets['Species']=='dog']
```

18. Display all the animals that are not dogs by changing the `==` to `!=` from the code above.

19. Filter the `pets` dataframe to show just cats and dogs, using the following.
```
pets[pets['Species'].isin(['cat', 'dog'])]
```

20. Create a pie chart of the grouped data using the following.
```
species_counts = pets.groupby('Species').size()
px.pie(values=species_counts, names=species_counts.index)
```

21. Recreate the previous pie chart and add a title.

22. Find the average ([mean](https://simple.wikipedia.org/wiki/Mean)) of a column with the following.
```
pets['Age (years)'].mean()
```

23. Find the [median](https://simple.wikipedia.org/wiki/Median) value of a different column.

24. Create a new column in the dataframe that is kg instead of pounds using the following.
```
pets['Mass (kg)'] = pets['Weight (lbs)'] / 2.205
```

25. Create a new column that is `Time to Adoption (days)` from the column `Time to Adoption (weeks)`.

26. Group the data by `'Species'` and find the mean values using the following.
```
pets.groupby('Species').mean(numeric_only=True)
```

27. Group the data by `'Fixed'` and print the `.max`, `.min`, and `.sum` values. You will need to something like the following.
```
print(pets.groupby('Species').sum(numeric_only=True))
```

28. Make an interactive [sunburst chart](https://plotly.com/python/sunburst-charts) using the following.
```
px.sunburst(pets, path=['Species', 'Gender', 'Fixed'], values='Age (years)', title='Pets Ages by Species, Gender, and Fixed')
```

29. Display the row of the heaviest pet using the following.
```
pets.sort_values('Mass (kg)').head(1)
```
You may also need `, ascending=False`

30. Display the row of the pet that took the longest to get adopted.

## Intermediate Challenges

Each of these challenges is worth 5 points, and uses a large data set about [Pokémon](https://en.wikipedia.org/wiki/Pok%C3%A9mon) from [PokéAPI](https://pokeapi.co).

1. Import and display the Pokémon data using the following.
```
pokemon = pd.read_csv(pyodide.http.open_url('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/pokemon.csv'))
display(pokemon)
```

2. Display the column labels and descriptions using the following.
```
pokemon_labels = pd.read_csv(pyodide.http.open_url('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/pokemon_columns.csv'))
pd.set_option('display.min_rows', 50)
display(pokemon_labels)
pd.reset_option('display.min_rows')
```

3. Display the last five rows of the data set using `.tail()` after the dataframe.

4. Make a scatter plot with height on the x-axis and weight on the y-axis.

5. Sort the data by weight and make a bar graph of the 10 lightest Pokémon, with name on the x-axis.

6. Make a scatter plot to see if there is a correlation between attack and defense, and add `, trendline='ols'` to the plot.

7. Recreate the previous scatter plot with `, color='is_legendary'` to see if there is a difference in the slopes of the trendlines.

8. Make a scatter plot with a trendline to see if there is a relationship between `speed` and `base_experience`.

9. Use a visualization to see if capture rate tends to decrease with speed.

10. Make a histogram of total weight by shape.

11. Make a histogram of the **average** attack by generation.

12. Make a histogram with `x=['type1','type2']` to find out which types are most common.

13. Make a histogram with `color` on the x-axis and `base_happiness` on the y-axis (and `histfunc='avg'`) to see if certain colors are happier.

14. Which habitats have the most legendary Pokémon? You'll need to make a histogram with `pokemon[pokemon['is_legendary']]` and `histfunc='count'`.

15. Which are the five most common `evolves_from_species`? You'll need to use `.value_counts('evolves_from_species').head(5)`.

## Advanced Challenges

Each of these challenges is worth 15 points, and uses a large data set about music from [Spotify](https://spotify.com).

Start by loading the data set with
```
music = pd.read_csv(pyodide.http.open_url('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify-top-100-charts.csv'))
music.columns
```
and the column descriptions can be viewed with
```
spotify_labels = pd.read_csv(pyodide.http.open_url('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify-column-descriptions.csv'))
pd.set_option('display.min_rows', 20)
display(spotify_labels)
pd.reset_option('display.min_rows')
```

1. Make a scatter plot with a trendline to show a relationship between two variables (columns) in the data set.

2. Make a scatter plot with two different variables, and then describe the scatter plot using words in the following markdown cell.

The scatter plot...

3. Create a `release_year` column using the code below
```
music['release_year'] = music['release_date'].str.split('-').str[0].astype(int)
```
then make a bar graph from
```
music.groupby('release_year').count()
```
with `'track'` on the y-axis.

4. Use
```
music.sort_values('release_year')
```
to find a row that has an an incorrect release date in the year 1900. Then drop that row with
```
music = music.drop(n)
```

where `n` is the row number. Recreate the `music.groupby('release_year').count()` bar graph.

5. Create a [heat map](https://plotly.com/python/heatmaps) of the [correlation matrix](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) using the code below, then choose two variables that are strongly correlated and write about why you think that might be.
```
px.imshow(music.corr(numeric_only=True), height=600, title='Correlations in Spotify Data')
```

I think these variables are strongly correlated because...

6. Use a `for` loop to display multiple graphs, like in the code example below
```
for variable in ['danceability', 'energy', 'loudness']:
    px.scatter(music, x='tempo', y=variable, trendline='ols', title=f'{variable} versus Tempo').show()
```
Then write about what you observe in the graphs.

In the graphs...

7. Identify some [outliers](https://en.wikipedia.org/wiki/Outlier) in the data set.

Some outliers...

8. Create a graph to see if song duration has changed over time in this data set. Remember to include a trendline.

Over time the song duration has...

9. Are there any features that are common to really long, or really short, songs?

10. How many artists have multiple tracks in the data set?

In the data set...

11. Create a new `'release_decade'` column with
```
music['release_decade'] = music['release_year'] // 10 * 10
```
then create a graph to compare the mean (or min) energy grouped by release_decade.

12. Make and describe a new visualization (graph).

The visualization...

13. Analyse data from the [Billions Club](https://open.spotify.com/playlist/37i9dQZF1DX7iB3RCnBnN4) playlist by importing `https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify-billions-club.csv`

The "Billions Club" playlist...

The Spotify Web API allows us to get information about songs, albums, and artists. If you want to retireve more data and have a [Spotify account](https://www.spotify.com/us/signup), you can sign in to the [Developers Dashboard](https://developer.spotify.com/dashboard/login). From the Dashboard, you can click the `CREATE AN APP` button, type a name and description, and then click `CREATE`. Clicking on your new app in the Dashboard will show you the `Client ID` and `CLIENT SECRET` that you can paste into the code cell below.

In [None]:
CLIENT_ID = 'PASTE_YOUR_CLIENT_ID_HERE'
CLIENT_SECRET = 'PASTE_YOUR_CLIENT_SECRET_HERE'

import requests
try:
    auth_response = requests.post('https://accounts.spotify.com/api/token', {'grant_type':'client_credentials', 'client_id':CLIENT_ID, 'client_secret':CLIENT_SECRET})
    auth_response_data = auth_response.json()
    access_token = auth_response_data['access_token']
    headers = {'Authorization':'Bearer {token}'.format(token=access_token)}
    print('Spotify API setup complete')
except:
    print('Remember to paste your client ID and secret into the code')

def find_tracks(search_string):
    try:
        r = requests.get('https://api.spotify.com/v1/search?q=' + search_string + '&type=track', headers=headers)
        info = r.json()
    except:
        print('Error with search string:', search_string)
        info = None
    return info

def get_track_info(track_id):
    try:
        r = requests.get('https://api.spotify.com/v1/tracks/' + track_id, headers=headers)
        info = r.json()
    except:
        print('Error with track id:', track_id)
        info = None
    return info

def get_track_features(track_id):
    try:
        r = requests.get('https://api.spotify.com/v1/audio-features/' + track_id, headers=headers)
        info = r.json()
    except:
        print('Error with track id:', track_id)
        info = None
    return info

14. Get a data set from playlist, either one you created or someone else's playlist. You need the playlist ID [from the playlist link](https://clients.caster.fm/knowledgebase/110/How-to-find-Spotify-playlist-ID.html).

In [None]:
playlist_id = '37i9dQZF1DX7iB3RCnBnN4'

print('This may take a couple of minutes')
tracks = []
for x in range(50):  
    offset = x*100  # it only returns 100 tracks at a time
    try:
        r = requests.get('https://api.spotify.com/v1/playlists/' + playlist_id + '/tracks?offset=' + str(offset), headers=headers)
        for item in r.json()['items']:
            tracks.append([item['track']['artists'][0]['name'], item['track']['name'], item['track']['id'], item['track']['album']['release_date']])
    except:
        print(offset)
        break
pl = pd.DataFrame(tracks, columns=['artist', 'track', 'id', 'release_date'])

track_features = {}
for row in pl.itertuples():
    print(row[1], row[2]) # artist and track
    id = row[3]
    features = get_track_features(id)
    track_features[id] = features

from IPython.display import clear_output
clear_output()

tf = pd.DataFrame(track_features).T
playlist = pd.merge(pl, tf, on='id')
playlist

15. Create an interesting visualization using the data you have just retrieved from a playlist.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)