# Assignment - Data Analysis and Visualization Practice 

This assignment is part of the [Zero to Data Science Bootcamp by Jovian](https://jovian.ai/learn/zero-to-data-analyst-bootcamp).

As you go through this notebook, you will find the symbol **???** in certain places. To complete this assignment, you must replace all the **???** with appropriate values, expressions, or statements to ensure that the notebook runs properly end-to-end. 

**Guidelines**

1. Make sure to run all the code cells in order. Otherwise, you may get errors like `NameError` for undefined variables.
2. Do not change variable names, delete cells, or disturb other existing code. It may cause problems during evaluation.
3. In some cases, you may need to add some code cells or new statements before or after the line of code containing the **???**. 
4. Since you'll be using a temporary online service for code execution, save your work by running `jovian.commit` at regular intervals.
5. Questions marked **(Optional)** will not be considered for evaluation and can be skipped. They are for your learning.


**How to Get Help**

If you are stuck, you can ask for help on the Bootcamp Slack group. Post errors, ask for hints, and help others. Follow these guidelines for getting help:

- Try to spend at least 20 minutes solving a problem before asking for help. But if you've been stuck on a problem for more than 2 hours, please ask for help!
- When you ask a question, make sure to explain what you have tried already and share your code and results if possible. This makes it easy for others to help you faster.
- Try to be as specific as possible when asking a question. E.g. "I can't solve question 3" is too vague. "I can't figure out how to create separate lines for different countries in question 1. Here's what I've tried: ..." is specific enough.
- Help others by resolving their errors, suggesting the correct approaches and sharing relevant resources. **Please don't share a complete working solution code on Slack** to give others a chance to solve the problem themselves.


Make a submission here: https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/assignment-6-data-analysis-and-visualization-practice

### How to run the code and save your work


**Option 1: Running using free online resources (1-click, recommended):** The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Colab**. [Follow these instructions](https://jovian.ai/docs/user-guide/run.html#run-on-colab) to connect your Google Drive with Jovian.


**Option 2: Running on your computer locally:** To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

**Saving your work**: You can save a snapshot of the assignment to your [Jovian](https://jovian.ai) profile, so that you can access it later and continue your work. Keep saving your work by running `jovian.commit` from time to time.

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
project_name = 'data-analysis-visualization-practice-assignment'

In [None]:
jovian.commit(project=project_name)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

Let's begin by installing and importing the required libraries.

In [None]:
#restart the kernel after installation
!pip install numpy pandas-profiling matplotlib seaborn plotly folium opendatasets wordcloud --quiet --upgrade

In [None]:
import numpy as np
import pandas as pd
import opendatasets as od
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import folium
import wordcloud

> **QUESTION 1 (Line Chart)**: The Gapminder dataset provides population data from 1952 to 2007 (at 5 year intervals) for several countries around the world. Compare the populations of the European countries France, United Kingdom, Italy, Germany and Spain over this period using a line chart. Make appropriate modifications to the chart title, axis titles, legend, figure size, font size, colors etc. to make the chart readable and visually appealing.
>
> Hints (not all of these may be useful):
>
> - You can use either Matplotlib or Plotly to create this chart
> - To select the data for the given countries, you may find the `isin` method of a Pandas series useful

In [None]:
gapminder_url = 'https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv'

In [None]:
gapminder_df = pd.read_csv(gapminder_url)

In [None]:
gapminder_df

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.853030
2,Afghanistan,1962,10267083.0,Asia,31.997,853.100710
3,Afghanistan,1967,11537966.0,Asia,34.020,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418.0,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340.0,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948.0,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563.0,Africa,39.989,672.038623


In [None]:
countries = ['France', 'United Kingdom', 'Italy', 'Germany', 'Spain']

In [None]:
pop_df = pd.DataFrame()
pop_df['year'] = gapminder_df['year'].unique()

#Appending dataframe columns with country's population
for country in countries:
  pop_df[country] = list(gapminder_df[gapminder_df['country'].isin([country])]['pop'])
  pop_df = pop_df.set_index(pop_df['year'])
del pop_df['year']


In [None]:
pop_df

Unnamed: 0_level_0,France,United Kingdom,Italy,Germany,Spain
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1952,42459667.0,50430000.0,47666000.0,69145952.0,28549870.0
1957,44310863.0,51430000.0,49182000.0,71019069.0,29841614.0
1962,47124000.0,53292000.0,50843200.0,73739117.0,31158061.0
1967,49569000.0,54959000.0,52667100.0,76368453.0,32850275.0
1972,51732000.0,56079000.0,54365564.0,78717088.0,34513161.0
1977,53165019.0,56179000.0,56059245.0,78160773.0,36439000.0
1982,54433565.0,56339704.0,56535636.0,78335266.0,37983310.0
1987,55630100.0,56981620.0,56729703.0,77718298.0,38880702.0
1992,57374179.0,57866349.0,56840847.0,80597764.0,39549438.0
1997,58623428.0,58808266.0,57479469.0,82011073.0,39855442.0


In [None]:
fig=px.line(pop_df,title='Population')

In [None]:
fig.update_layout(
    title="Population data from 1952 to 2007",
    xaxis_title="Year",
    yaxis_title="Population (in millions)",
    legend_title="Country",
    plot_bgcolor="#ffcc9c",
    font=dict(
        family="Arial",
        size=14,
        color="#cc3e0e"
    )
)
fig.update_yaxes(rangemode="tozero")

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

> **QUESTION 2 (Scatter Plot)**: `diamonds_url` points to a CSV file containing various attributes like carat, cut, color, clarity, price etc. for over 53,000 diamonds. Visualize the relationship between the carat (size of diamond) and price using a scatter plot. Instead of using the entire dataset for this visualization, just pick the diamonds with a clarify `"SI2"` and color `"E"`. Use the values of the "cut" column to color the dots in the scatter plot. Make appropriate modifications to the chart title, axis titles, legend, figure size, font size, colors etc. to make the chart readable and visually appealing.
>
> Hints (not all of these may be useful):
> - You can use Seaborn or Plotly to create the scatter plot for this dataset
> - Check [this stackoverflow answer](https://stackoverflow.com/questions/22591174/pandas-multiple-conditions-while-indexing-data-frame-unexpected-behavior) for selecting data frame rows using multiple conditions.



In [None]:
diamonds_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv'

In [None]:
diamonds_df = pd.read_csv(diamonds_url)

In [None]:
diamonds_df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [None]:
plot_df=diamonds_df[(diamonds_df.clarity=='SI2')&(diamonds_df.color=='E')]
plot_df.shape

(1713, 10)

In [None]:
fig=px.scatter(plot_df,
           title='Relationship between the carat (size of diamond) and price',
           x='carat',
           y='price',
           opacity=0.5,
           color='cut'
           )
fig.update_layout(xaxis_title='Diamond carat(size)',
                  yaxis_title='Price($)',
                  legend_title='Diamond cut')

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

> **QUESTION 3 (Histogram and Box Plot):** The Planets dataset contains details about the 1,000+ extrasolar planets discovered up to 2014. Visualize the distribution of the masses of the planets (expressed as a multiple of the mass of Jupiter), using a histogram and a box plot. Make appropriate modifications to the chart title, axis titles, legend, figure size, font size, colors etc. to make the chart readable and visually appealing.
> 
> Hints:
>
> - You use use Matplotlib, Seaborn or Plotly to create these plots
> - If you're using Plotly, you can show both charts together (use the `marginal` argument of `px.histogram`)

In [None]:
planets_csv = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/planets.csv'

In [None]:
planets_df = pd.read_csv(planets_csv)

In [None]:
planets_df

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


In [None]:
fig=px.histogram(planets_df,
                 title='Distribution of the masses of the extrasolar planets',
                 x="mass",
                 marginal="box",
                 hover_data=planets_df.columns)
fig.update_layout(xaxis_title='Mass (multiples of Jupiter)',
                  yaxis_title='No of planets')
fig.show()

> **(Optional) Question** Answer the following questions:
> 
> - Is the distribution of planet mass an exponential or gaussian?
> - What is the median exoplanet mass? How does it compare to the maximum?
> - How does the mass of a planet compare with its orbital period? Visualize their relationship using a scatter plot.

In [None]:
1) Exponential

In [None]:
print(planets_df.mass.median())
print(planets_df.mass.max())

1.26
25.0


In [None]:
px.scatter(planets_df,
           title='Comparison of mass of a planet with its orbital period',
           x="mass", 
           y="orbital_period", 
           color="distance"
          )

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

> **QUESTION 4 (Bar Chart):** The Job Automation Probability dataset, created during a [Future of Employment study from 2013](https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf), estimates the probability of different jobs being automated in the 21st century due to computerization. Crate a bar chart to show the 25 jobs requiring a "Bachelor's degree" (and no higher qualification) that are most likely to be automated. Make appropriate modifications to the chart title, axis titles, legend, figure size, font size, colors etc. to make the chart readable and visually appealing.

In [None]:
job_automation_url = 'https://raw.githubusercontent.com/plotly/datasets/master/job-automation-probability.csv'

In [None]:
job_automation_df = pd.read_csv(job_automation_url)

In [None]:
job_automation_df

Unnamed: 0,_ - rank,_ - code,prob,Average annual wage,education,occupation,short occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
0,624,51-4033,0.9500,34920.0,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machi...","Tool setters, operators and tenders",35,0.9500,74600,32890.0,74600,34920.0
1,517,51-9012,0.8800,41450.0,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitati...","Tool setters, operators and tenders",35,0.8800,47160,38360.0,47160,41450.0
2,484,41-4012,0.8500,68410.0,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufactu...","Sales Representatives, Wholesale and Manufactu...",92,0.8500,1404050,57140.0,1404050,68410.0
3,105,53-1031,0.0290,59800.0,High school diploma or equivalent,First-Line Supervisors of Transportation and M...,Supervisors Transportation,26,0.0290,202760,57270.0,202760,59800.0
4,620,51-4072,0.9500,32660.0,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setter...","Molding, Coremaking and Casting Machine Setter...",89,0.9500,145560,30480.0,145560,32660.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,259,27-2011,0.3700,71313.6,"Some college, no degree",Actors,Actors,6,0.3700,48620,33473.0,48620,71313.6
698,522,51-3011,0.8900,27110.0,No formal educational credential,Bakers,Bakers,6,0.8900,180450,25090.0,180450,27110.0
699,42,21-2011,0.0081,49450.0,Bachelor's degree,Clergy,Clergy,6,0.0081,49320,45740.0,49320,49450.0
700,669,41-9012,0.9800,36560.0,No formal educational credential,Models,Models,6,0.9800,4390,21870.0,4390,36560.0


In [None]:
fig = px.bar(job_automation_df[job_automation_df.education=="Bachelor's degree"].sort_values(by='probability',ascending=False).head(25), 
             x="probability", 
             y="occupation", 
             title="Future of Employment Study"
             )
fig.update_layout(xaxis_title='Probability',
                  yaxis_title='Occupation')
fig.show()

> **(Optional) Question:** What other insights can you derive from the above data. Create some more visualizations and summarize your insights below.

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

> **QUESTION 5 (Geographical Map):** `nuclear_waste_url` points to a CSV file containing the locations of several nuclear waste storage sites in the United States. Show these sites as markers on a map of the United States. Clicking on a marker should display the name of the site. Pick the appropriate location, Zoom level and images tiles for the map.

In [None]:
nuclear_waste_url = 'https://raw.githubusercontent.com/plotly/datasets/master/Nuclear%20Waste%20Sites%20on%20American%20Campuses.csv'

In [None]:
nuclear_waste_df = pd.read_csv(nuclear_waste_url)

In [None]:
nuclear_waste_df

Unnamed: 0,lat,lon,text
0,35.888827,-106.305022,Acid/Pueblo Canyon
1,39.503487,-84.743859,Alba Craft Shop
2,44.620822,-123.120917,"""Albany, Oregon, FUSRAP Site"""
3,40.641371,-80.242936,Aliquippa Forge
4,39.361063,-84.54075,Associated Aircraft Tool and Manufacturing Co.
5,39.957354,-83.011455,B & T Metals
6,41.672189,-83.568625,Baker Brothers
7,40.746102,-74.006642,Baker and Williams Warehouses
8,35.899725,-106.290127,"""Bayo Canyon, New Mexico, FUSRAP Site"""
9,42.841147,-78.83406,Bliss and Laughlin Steel


In [None]:
m=folium.Map(location=[nuclear_waste_df.lat[0],nuclear_waste_df.lon[0]],zoom_start=5,tiles='Stamen Terrain')

tooltip = "Click me!"
for i in range(len(nuclear_waste_df)):
  folium.Marker(
      [nuclear_waste_df.lat[i],nuclear_waste_df.lon[i]], popup=nuclear_waste_df.text[i], tooltip=tooltip
  ).add_to(m)
m

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

### Make a Submission

You can make a submission by executing the following cell, or by providing the link to your Jovian notebook on this page: https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/assignment-6-data-analysis-and-visualization-practice

In [None]:
jovian.submit('zerotodatascience-dataviz')

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m
[jovian] Submitting assignment..[0m
[jovian] Verify your submission at https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/data-analysis-and-visualization-practice[0m


You can submit any number of times. Only your final submission will be considered for evaluation.

The remaining questions in this assignment are optional.

> **(OPTIONAL) QUESTION 7:**  The following dataset contains information about some police deaths in US from 1984 to 2016. Create the following visualizations using this dataset.
>
> 1. Bar chart showing the total deaths per year for 1984-2016
> 2. Line chart comparing the yearly deaths in different states
> 3. A heat map overlaid on the map of the United States 
> 4. A Choropleth comparing the number of police deaths in different states
> 5. A word cloud of the different causes of death (remove the string "Cause of Death:" for best results)
> 6. A [marker cluster](https://georgetsilva.github.io/posts/mapping-points-with-folium/) showing the shootings in the state of California (show person's name on hover)
> 7. Bar chart comparing the no. of deaths due to different causes, animated by year
> 8. A heatmap of total deaths per state per year i.e. showing "state" on one axis and "year" on the other axis.
> 9. A treemap with three levels: state, city (description), cause
> 10. A rug plot showing a timeline of canine deaths.

In [None]:
police_url = 'https://raw.githubusercontent.com/plotly/datasets/master/US-shooting-incidents.csv'
police_df = pd.read_csv(police_url)

In [None]:
police_df

Unnamed: 0.1,Unnamed: 0,person,dept,eow,cause,cause_short,date,year,canine,dept_name,state,description,latitude,longitude,state_name
0,1,K9 Roscoe,"Phoenix Police Department, AZ","EOW: Friday, July 13, 1984",Cause of Death: Struck by vehicle,Struck by vehicle,1984-07-13,1984,True,Phoenix Police Department,AZ,Phoenix,33.448143,-112.096962,Arizona
1,2,"Police Officer Roy L. Leon, Jr.","Cotton Plant Police Department, AR","EOW: Friday, July 13, 1984",Cause of Death: Gunfire,Gunfire,1984-07-13,1984,False,Cotton Plant Police Department,AR,Little Rock,34.746613,-92.288986,Arkansas
2,3,Officer Stanley D. Pounds,"Portland Police Bureau, OR","EOW: Wednesday, July 18, 1984",Cause of Death: Automobile accident,Automobile accident,1984-07-18,1984,False,Portland Police Bureau,OR,Salem,44.938461,-123.030403,Oregon
3,4,"Enforcement Agent Ernest Joseph Gray, Jr.","Pennsylvania Public Utility Commission, PA","EOW: Friday, July 20, 1984",Cause of Death: Automobile accident,Automobile accident,1984-07-20,1984,False,Pennsylvania Public Utility Commission,PA,Harrisburg,40.264378,-76.883598,Pennsylvania
4,5,"Police Officer James W. Carozza, Jr.","Greenburgh Police Department, NY","EOW: Friday, July 20, 1984",Cause of Death: Vehicle pursuit,Vehicle pursuit,1984-07-20,1984,False,Greenburgh Police Department,NY,Albany,42.652843,-73.757874,New York
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4994,4995,"Deputy Sheriff David Francis Michel, Jr.","Jefferson Parish Sheriff's Office, LA","EOW: Wednesday, June 22, 2016",Cause of Death: Gunfire,Gunfire,2016-06-22,2016,False,Jefferson Parish Sheriff's Office,LA,Baton Rouge,30.457069,-91.187393,Louisiana
4995,4996,K9 Tyson,"Fountain County Sheriff's Office, IN","EOW: Monday, June 27, 2016",Cause of Death: Heat exhaustion,Heat exhaustion,2016-06-27,2016,True,Fountain County Sheriff's Office,IN,Indianapolis,39.768623,-86.162643,Indiana
4996,4997,K9 Credo,"Long Beach Police Department, CA","EOW: Tuesday, June 28, 2016",Cause of Death: Gunfire (Accidental),Gunfire (Accidental),2016-06-28,2016,True,Long Beach Police Department,CA,Sacramento,38.576668,-121.493629,California
4997,4998,"Deputy Sheriff Martin Tase Sturgill, II","Humphreys County Sheriff's Office, TN","EOW: Thursday, June 30, 2016",Cause of Death: Heart attack,Heart attack,2016-06-30,2016,False,Humphreys County Sheriff's Office,TN,Nashville,36.165810,-86.784241,Tennessee


Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/data-analysis-visualization-practice-assignment" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment[0m


'https://jovian.ai/ankyhunk-bg4/data-analysis-visualization-practice-assignment'

## Acknowledgement

The datasets in this assignment are taken from the following sources:

* Seaborn datasets: https://github.com/mwaskom/seaborn-data
* Plotly datasets: https://github.com/plotly/datasets
* Five Thirty Eight datasets: https://github.com/fivethirtyeight/data

Check this links for many more datasets that can be used for data analysis and visualization.