![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Data Science Hackathon

As our world becomes increasingly data-driven, the ability to analyze, visualize, and draw insights from large and complex datasets is becoming an essential skill. Data science can provide you with the tools to make better decisions and solve complex problems.

Click on the cell below, then click the Run button above to import and display a dataframe about *hypothetical* pets for adoption from our friends at [Bootstrap](https://www.bootstrapworld.org/materials/data-science).

In [None]:
import pandas as pd
pets = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/pets.csv')
pets

To create visualizations, we will use [Plotly Express](https://plotly.com/python/plotly-express). Click on the cell below then click Run to display a `bar` graph.

In [None]:
import plotly.express as px
px.bar(pets, x='Name', y='Age (years)', title='Pets Ages')

## Beginner Challenges

Each of these challenges is worth 2 points, and uses the `pets` dataframe.

1. make a bar graph with `Name` on the x-axis and `Legs` on the y-axis

2. make a bar graph using
```
x='Species', y='Age (years)', color='Gender'
```

3. recreate the previous bar graph, but add
```
, barmode='group'
```

4. make a line graph by changing `bar` to `line` from the example
```
px.bar(pets, x='Name', y='Age (years)', title='Pets Ages')
```

5. make a scatter plot with
```
px.scatter(pets, x='Name', y='Age (years)')
```

6. recreate the previous scatter plot, but add a title

7. recreate the previous scatter plot and color the points by `Gender`

8. make a scatter plot comparing `Age (years)` with `Weight (lbs)`

9. recreate the previous scatter plot, but add
```
, size='Time to Adoption (weeks)'
```

10. make a [histogram](https://plotly.com/python/histograms) (like a bar graph showing calculated statistics) using
```
px.histogram(pets, x='Species', y='Weight (lbs)', title='Total Weight by Species')
```

11. recreate the histogram, adding the following parameter to calculate average instead of totals
```
, histfunc='avg'
```

12. sort the `pets` dataframe by `Age (years)`
```
pets.sort_values('Age (years)')
```

13. sort by age in descending order by adding
```
, ascending=False
```

14. make a bar graph of the sorted dataframe

15. display just one column of the dataframe
```
pets['Species']
```

16. display a few columns of the dataframe
```
pets[['Name', 'Legs', 'Time to Adoption (weeks)']]
```

17. display just the dogs
```
pets[pets['Species']=='dog']
```

18. display all the animals that are not dogs
```
pets[pets['Species']!='dog']
```

19. filter the `pets` dataframe to show just cats and dogs, using
```
pets[pets['Species'].isin(['cat', 'dog'])]
```

20. create a pie chart of the grouped data using the following code:
```
species_counts = pets.groupby('Species').size()
px.pie(values=species_counts, names=species_counts.index)
```

21. recreate the previous pie chart, and add a title

22. find the average (mean) of a column with
```
pets['Age (years)'].mean()
```

23. find the median value of a different column

24. create a new column in the dataframe that is kg instead of pounds using
```
pets['Mass (kg)'] = pets['Weight (lbs)'] / 2.205
```

25. create a new column that is `Time to Adoption (days)`

26. group the data by `'Species'` and find the mean values using
```
pets.groupby('Species').mean(numeric_only=True)
```

27. group the data by `'Fixed'` and display the `.max`, `.min`, and `'.sum` values

28. make an interactive [sunburst chart](https://plotly.com/python/sunburst-charts) using
```
px.sunburst(pets, path=['Species', 'Gender', 'Fixed'], values='Age (years)', title='Pets Ages by Species, Gender, and Fixed')
```

29. Display the row of the heaviest pet using 
```
pets.sort_values('Mass (kg)', ascending=False).head(1)
```

30. Display the row of the pet that took the longest to get adopted.

## Intermediate Challenges

Each of these challenges is worth 5 points, and uses a large data set about [Pokémon](https://en.wikipedia.org/wiki/Pok%C3%A9mon) from [PokéAPI](https://pokeapi.co).

1. import and display the Pokémon data using
```
pokemon = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/pokemon.csv')
display(pokemon)
```

2. display the column labels and descriptions using
```
pokemon_labels = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/pokemon_columns.csv')
pd.set_option('display.min_rows', 50)
display(pokemon_labels)
pd.reset_option('display.min_rows')
```

3. display the last five rows of the data set using `.tail()`

4. make a scatter plot with height on the x-axis and weight on the y-axis

5. sort the data by weight and make a bar graph of the 10 heaviest Pokémon, with name on the x-axis

6. make a scatter plot to see if there is a correlation between attack and defense, and add `, trendline='ols'`

7. recreate the previous scatter plot with `, color='is_legendary'` to see if there is a difference in the slopes of the trendlines

8. make a scatter plot with a trendline to see if there is a relationship between speed and base_experience

9. does capture rate tend to decrease with speed?

10. make a histogram of total weight by shape

11. make a histogram of the **average** attack by generation

12. make a histogram with `x=['type1','type2']` to find out which types most common

13. make a histogram with color on the x-axis and base_happiness on the y-axis (and histfunc='avg') to see if certain colors are happier

14. which habitats have the most legendary Pokémon?

15. which are the five most common `evolves_from_species`? (use `.value_counts().head(5)`)

## Advanced Challenges

Each of these challenges is worth 15 points, and uses a large data set about music from [Spotify](https://spotify.com).

Start by loading the data set with
```
music = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify.csv')
music.columns
```

1. Make a scatter plot with a trendline to show a relationship between two columns (variables) in the data set.

2. Make a scatter plot with two different variables, and then describe it in the following markdown cell.

3. Create a `release_year` column using the code below, and then make a bar graph from `music.groupby('release_year').count()` with `'track'` on the y-axis.
```
music['release_year'] = music['release_date'].str.split('-').str[0]
```

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)