# Lesson 3: Visualizing Data

Goals and Key Ideas:
1. Purposes and types of data visualizations
    + Determining the correct type of data visualization for a given type of variable
2. Good and bad data visualizations
    + Principles of good data visualizations
3. Creating data visualizations using python and the `seaborn` library
    + Visualizing the distribution of a variable: **Bar plots** and **Histograms**
    + Visualizing the relationship between two variables: **Scatterplots**
  
Here are the documentations for using seaborn.:

displot: https://seaborn.pydata.org/generated/seaborn.displot.html

histplot: https://seaborn.pydata.org/generated/seaborn.histplot.html

barplot: https://seaborn.pydata.org/generated/seaborn.barplot.html

countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html

scatterplot: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

More information about matplotlib.pyplot (this is a great tool to organize your figures/plots): https://matplotlib.org/stable/tutorials/pyplot.html


## 3. Creating data visualizations using python and the `seaborn` library

In [None]:
# import the libraries (pandas, numpy, seaborn, matplotlib.pyplot)



## Example with Pets

In [None]:
df = pd.DataFrame({"Name":["A", "B", "C", "D", "E", "F"], \
                   "Species":["Cat", "Cat", "Dog", "Cat", "Dog", "Rabbit"],\
                   "Weight_lb":[25, 15, 100, 20, 20, 4], \
                   "Age":[8, 3, 4, 3, 1, 2]})

## 3.1 Bar graph

Bar graphs are used for categorical variables. In this case, it can be used for ‘Species’ or ‘Name’. ’Species’ would be more suitable because everyone has a distinct name here.

In [None]:
# Displot is a more general bar graph for plotting function
sns.displot(data = df, x = 'Species')

In [None]:
# Before we showed a count, let's show a proportion or a probability
sns.displot(data = df, x = 'Species', stat = 'probability')

In [None]:
# countplot is a specific type of bar graph (more specific than displot)
sns.countplot(data = df, x = 'Species')

## 3.2 Histogram

Histograms are used for a continuous variable rather than discrete variable. In here, a histogram would be appropriate to show the amount of animals that fall under a range of ages and/or weights.

ie: How many animals are within the ages of 1-3, 3-5, 5-7, etc.

In [None]:
# You can use displot or histplot; histplot is more specific than displot
# Specify number of bins (or number of categories)
sns.displot(data=df, x='Age',bins=2)

In [None]:
# Specify number of bins (or number of categories)
sns.displot(data=df, x='Age',bins=4)

In [None]:
# Specify width of bins (or how wide each bar will be)
sns.displot(data=df, x='Age',binwidth=1.5)

In [None]:
# Specify ranges with a list
sns.displot(data=df, x='Age',bins = [1,3,5,7,9])

## 3.3 Scatterplot

A scatterplot visualizes two numerical variables together - like a coordinate point. Here, age and
weight would be an appropriate choice if we want to find out if age and weight are correlated to
each other.

In [None]:
sns.scatterplot(data=df , x='Age', y='Weight_lb')

In [None]:
sns.scatterplot(data=df , x='Age', y='Weight_lb', hue='Species')

## More Examples

Here are some more example datasets that we can try to visualize

In [None]:
# load some example datasets

#1. top 50 songs on spotify in 2019
top50 = pd.read_csv('../../../shared/datasets/top50.csv')

#2. 
babyweight = pd.read_csv('../../../shared/datasets/babyweight.csv')

#3. NYC Tree Census
nyctrees = pd.read_csv('../../../shared/datasets/NYC_Tree_Census_small.csv')

## Practice

Grab the Medals dataset from https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo/data

Here is documentation on how to read in excel files using pandas: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

1. Visualize the Team/NOC and the number of medals received
2. Display the distribution of types of medal for the top 10 teams
3. Display the portion of the teams that received a certain number of medals, maybe in multiples of 20
4. Determine if there is a relationship between the total number of medals a team wins and the number of gold medals they win.


In [None]:
# read in data and name it df


In [None]:
# Question 1
# Set up 'area' to plot your figure
plt.figure(figsize=(20,20))
plt.title("Olympic Team and Total Number of Medals Won")

# The plot itself
sns.barplot(data=df, x='Team/NOC', y='Total', hue='Team/NOC', legend=False)

# Beautify
plt.xticks(rotation=90);

# Display the final plot
plt.show()

In [None]:
# Question 2
df=df.head(10)

# Set up 'area' to plot your figure
plt.figure(figsize=(20,20))
  
# Select info needed from the larger df
sub_df = df[['Team/NOC', 'Gold', 'Silver','Bronze']]
  
# The plot itself
# Note here the plotting function is from Pandas
# Stacked bar graphs can show subsets of information - that is, 
# out of the total medals, what proportion of them were gold/silver/bronze?
sub_df.plot(kind='bar', x='Team/NOC',stacked=True, color=['gold','silver', 'peru'],\
           title='Top 10 Teams & Medal Distribution', \
            ylabel = 'Medal Count');

In [None]:
# Question 3
sns.histplot(data=df, x="Total", stat='percent', bins=[0, 20, 40, 60, 80, 100, 120])

In [None]:
# Question 4
sns.scatterplot(y=df['Total'], x=df['Gold'])

In [None]:
# Question 4
temp=df.query('Gold<15')
sns.scatterplot(y=temp['Total'], x=temp['Gold'])