# Visualizing Data

This notebook will introduce you to data visualization using Pandas and the Seaborn package. Here are a few useful links related to Seaborn:
    
* Seaborn Gallery of examples - https://seaborn.pydata.org/examples/index.html
* Seaborn User guides and tutorials (great!) - https://seaborn.pydata.org/tutorial.html
* Pandas Visualization tutorial - https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html



## 1. Open and read data into a DataFrame

In [None]:
import pandas as pd
survey = pd.read_csv("smallsurvey.csv")

survey

## 2. Data Cleaning, Filtering, and Wrangling

Let's do some quick cleaning of the data like we did in the Pandas notebook. If you want more details, please refer to the Pandas notebook.

In [None]:

clean_survey = survey.replace("\" vanilla\"", "\"Vanilla\"")
clean_survey["Q1"] = clean_survey["Q1"].str.replace('"', '')
clean_survey["Q2"] = clean_survey["Q2"].str.replace('"', '')
clean_survey["Q3"] = clean_survey["Q3"].str.replace('"', '')
clean_survey["Q4"] = clean_survey["Q4"].str.replace('"', '')
clean_survey['Q2'] = pd.to_numeric(clean_survey['Q2'])

# Add a column called X1 with a list of values (same as before)
clean_survey["X1"] = [1, 1, 2, 1, 4, 2, 2 , 1, 3, 1, 2, 1, 3, 1]

# We are going to have 3 types of ice cream varieties: Vanilla, Chocolate, and Other

# List containing vanilla and chocolate
voc=["Vanilla", "Chocolate"]

# If Q3 is not in the list voc, then set to "Other"
clean_survey.loc[~clean_survey.Q3.isin(voc), 'Q3'] = "Other"

# Check it out
clean_survey

## 3. Data Analysis

We will skip this portion for now. Insert some data analysis here if you want to plot the results.

## 4. Data Visualization using Pandas!

Now we get to the cool visualizations. We will start with some easy plots and then show you a few more complex examples. These are just scratching the surface. Use the links at the top of this notebook to see a gallery of examples and possibilities.

In [None]:
# The most basic is to use the plot() function for a column
# This plot will only work for numeric data (so Q2 only)

clean_survey["Q2"].plot()

In [None]:
# We can tell the plot function what 'kind' of plot we want: bar, box, density, hexbin, hist, kde, line, pie, or scatter

# Let's try a histogram using the kind 'box' for box plot

clean_survey["Q2"].plot(kind="box")

### You try it!

In [None]:
# Write the code to make a 'hist'ogram plot. Use the code above as inspiration / a guide.



In [None]:
# Let's group some data!
# We can use groupby()

clean_survey.groupby("Q1").boxplot()

In [None]:
# Add some color to our box plot

# Colors use RGB scheme - use this link for more information: https://htmlcolorcodes.com/color-picker

plotcolor = {
    "boxes": "#00ffff",    # Cyan
    "whiskers": "#ff0000", # Red
    "medians": "#0000ff",  # Blue
    "caps": "#555555",     # Gray
}

clean_survey.groupby("Q1").boxplot(color=plotcolor)

In [None]:
# Scatter plot between Q1 and X1

clean_survey.plot.scatter(x="Q2", y="X1")

### You try it!

1. Go to the Pandas visualization guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
2. Make a plot of your own
3. Try to change the format and/or colors of the plot. Hint: You may need to look at matplotlib style.

In [None]:
# Make your awesome visualization here

## 5. Data Visualization using Seaborn!

Now we get to the even cooler visualizations. We will use the Seaborn package. There are multiple visualization packages available in Python. Others include Bokeh, ggplot, Plotly, folium, and geoplotlib. 
Similar to Pandas visualizations - we will start with some easy plots and then show you a few more complex examples. These are just scratching the surface. Use the links at the top of this notebook to see a gallery of examples and possibilities.

Seaborn divides its visualizations functions based on the type of data
* relplot - statistical relationships
* displot - distributions of data
* catplot - categorical data
* regplot - regression models

See more information here: https://seaborn.pydata.org/tutorial.html

In [None]:
# Import the seaborn package
import seaborn as sns

In [None]:
# Bar chart example (simple)

# We will plot categorical data (catplot)
graph = sns.catplot(
    data=clean_survey, kind="bar",
    y="Q2")

In [None]:
# Bar chart example - group by Q1 (left versus right handed)

# We will plot categorical data (catplot)
graph = sns.catplot(
    data=clean_survey, kind="bar",
    y="Q2", x="Q1")

In [None]:
# Swarm plot (with jitter so you can see overlapping data and hue so you can see ice cream preference too)

# We will plot categorical data (catplot)
graph = sns.catplot(
    data=clean_survey, jitter=True, hue="Q3",
    y="Q2", x="Q1")

### You try it!

Another type of visualization is called a violin plot. Use the code above and the guide below to create your own violine plot.

Look at the guide here: https://seaborn.pydata.org/tutorial/categorical.html#violinplots

In [None]:
# Create a violin plot here


## 6. Visualization using a bigger dataset

Let's use a bigger dataset from the seaborn sample data drawn from the Titanic.
This will allow us to try some of the more interesting visualization techniques.

In [None]:

# Load the titanic dataset and save it as a dataframe called tsurvey
tsurvey = sns.load_dataset('titanic')

# Check it out
tsurvey

In [None]:
# Create a binned histogram plot by age

sns.displot(tsurvey, x="age", bins=[5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55])

In [None]:
# Create a binned histogram plot by age
# Divide by sex

sns.displot(tsurvey, x="age", bins=[5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55], hue="sex")

In [None]:
# Let's change the style of this plot

# Set the context between: paper, talk, poster, notebook
sns.set_context("talk")

# Set the style between: white, dark, whitegrid, ticks
sns.set_style("white")

sns.displot(tsurvey, x="age", bins=[5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55], hue="sex", kde=True)

In [None]:
# Regression: age and fare.

# Set the context back to paper. Otherwise it is hard to see.
sns.set_context("paper")

sns.regplot(data=tsurvey, x="age", y="fare")

In [None]:
# Regression: age and fare. hue based on sex.

# Set the context back to paper. Otherwise it is hard to see.
sns.set_context("paper")

sns.lmplot(data=tsurvey, x="age", y="fare", hue="sex")

In [None]:
# Regression: age and fare. hue based on sex. divided by class.

# Note: Certainly could be improved. This is just to show you what is possible.

sns.lmplot(data=tsurvey, x="age", y="fare", hue="sex", col="class")

In [None]:
# Distribution

sns.displot(tsurvey, x="age", y="fare", hue="sex", kind="kde")

In [None]:
# Joint plot

sns.jointplot(
    data=tsurvey,
    x="age", y="fare", hue="sex",
    kind="kde"
)