<a href="https://colab.research.google.com/github/emilsar/bit-of-data-science-and-scikit-learn/blob/master/TRAIN_AWS_Part_II_Day_1_Lab_Part_I_Notebook_%5BEmil%20Sargsyan%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 1: Part I - Review of Data Science**
---
### **Description**
This lab provides a comprehensive overview of exploratory data analysis (EDA) techniques using Python's pandas library for data manipulation and analysis. Additionally, it explores data visualization using the matplotlib library. Throughout the notebook, you'll review how to load and manipulate datasets effectively with pandas commands and leverage matplotlib to create insightful visualizations that aid in uncovering patterns, trends, and insights within the data.

<br>

### **Lab Structure**
**Part 1**: [Exploratory Data Analysis Review](#p1)

  >  **Part 1.1**: [Basic Commands](#p1.1)

  >  **Part 1.2**: [Further Exploration](#p1.2)

**Part 2**: [Data Visualization Review](#p2)

  >  **Part 2.1**: [Scatter Plots](#p2.1)

  >  **Part 2.2**: [Line Plots](#p2.2)

  >  **Part 2.3**: [Bar Plots](#p2.3)

**Part 3**: [[OPTIONAL] Improving Visualizations](#p3)
  >  **Part 3.1**: [Improving Scatter Plots](#p3.1)

  >  **Part 3.2**: [Improving Line Plots](#p3.2)

  >  **Part 3.3**: [Improving Bar Plots](#p3.3)

  >  **Part 3.4**: [Enhancing Plot Aesthetics](#p3.4)



<br>

### **Learning Objectives**
 By the end of this lab, we will:
* Understand basic pandas commands for EDA.

* Understand basic matplotlib commands for Data Visualization.


<br>


### **Resources**
* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1FFoqw45P-kuoq912ARP4qfdGeLTqoq73_qjZThPp2_8/edit?usp=drive_link)

* [Data Visualization with matplotlib Cheat Sheet](https://docs.google.com/document/d/1YlUp6ll81qOyDpU1OWzE-SPxQ3hnF5C9ukLRL_6PYKE/edit?usp=drive_link)


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

<a name="p1"></a>

---
## **Part 1: Exploratory Data Analysis Review**
---




<a name="p1.1"></a>

---
### **Part 1.1: Basic Commands**
---


**Run the code cell below to create the DataFrame.**

In [None]:
df = pd.DataFrame({'U.S. State': ['California', 'Florida', 'Indiana', 'Texas', 'Pennsylvania'],
        'Population (in millions)': [38, 21, 6.5, 28, 13],
        'Capitol': ['Sacramento', 'Tallahassee', 'Indianapolis', 'Austin', 'Harrisburg'],
        'GDP ($ in billions)': [3700, 1070, 352, 1876, 726]})

#### **Problem #1.1.1**

**Together**, let's inspect what `.head()` tells us about this DataFrame.

#### **Problem #1.1.2**

**Together**, let's determine what datatype `Population (in millions)` is.

#### **Problem #1.1.3**

**Together**, let's print all of the unique values for `GDP ($ in billions)`.

---

#### **Now it's your turn! Try Problems #1.1.4 - 1.1.7 on your own.**

---

#### **Problem #1.1.4**

**Independently**, determine the column names in the dataset.

#### **Problem #1.1.5**

**Independently**, determine the highest `GDP ($ in billions)` in the dataset.

#### **Problem #1.1.6**

**Independently**, determine which states are included in this dataset.

#### **Problem #1.1.7**

**Independently**, determine the range of GDP values among the states?

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p1.2"></a>

---
### **Part 1.2: Further Exploration**
---



#### **Problem #1.2.1**

**Independently**, determine the average `Population (in millions)` size among the U.S. states in the dataset.

#### **Problem #1.2.2**

**Independently,** explore rows 4 and 5. What are the U.S. States listed?

#### **Problem #1.2.3**

**Independently**, determine the total `Population (in millions)` across all states.

#### **Problem #1.2.4**

**Independently**, determine the average `Population (in millions)` of the states.

#### **Problem #1.2.5**

**Independently**, determine the `Population (in millions)` for the 3rd state in the dataset.

#### **Problem #1.2.6**

**Independently**, determine how many states have a population greater than 20 million.

#### **Problem #1.2.7**

**Independently**, explore the last row in the dataset.

#### **[Challenge Question] Problem #1.2.8**

**Independently**, determine the average `GDP per capita` for the states.

**HINT:** Divide `GDP per capita` by `Population (in millions)`.

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p2"></a>

---
## **Part 2: Data Visualization Review**
---

**Run the cell below to load in the data**

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS9jPkeKJ8QUuAl-fFdg3nJPDP6vx1byvIBl4yW8UZZJ9QEscyALJp1eywKeAg7aAffwdKP63D9osF1/pub?gid=169291584&single=true&output=csv"
movie_df = pd.read_csv(url)

movie_df.drop_duplicates(inplace=True)

mean_runtime = movie_df['Runtime'].mean()
movie_df['Runtime'] = movie_df['Runtime'].fillna(mean_runtime)

movie_df = movie_df.rename(columns = {"Runtime": "Runtime (min)"})
movie_df = movie_df.astype({"Runtime (min)": "int64"})

movie_df.head()

<a name="p2.1"></a>

---
### **Part 2.1: Scatter Plots**
---

#### **Problem #2.1.1**

**Together**, let's create a scatterplot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Gross Money vs. Runtime:
* `X-axis`: "Gross (USD)"
* `Y-axis`: "Runtime (min)"

---

#### **Now it's your turn! Try Problem #2.1.2 on your own.**

---

#### **Problem #2.1.2**

**Independently**, create a scatterplot using `Released_Year` as the x-axis value and `Runtime (min)` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year"
* `X-axis`: "Year"
* `Y-axis`: "Runtime (min)"

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p2.2"></a>

---
### **Part 2.2: Line Plots**
---

#### **Problem #2.2.1**

**Together**, let's create a line plot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Gross Money vs. Runtime'`.
* X-axis label including units `'min'`.
* Y-axis label including units `'USD'`.

<br>

**NOTE**: This is not going to be a particularly helpful graph (the scatter plot is a better choice), but we oftentimes will not know this ahead of time. A lot of EDA and visualizations involves trying a number of things and seeing what is useful.

---

#### **Now it's your turn! Try Problem #2.2.2 on your own.**

---

#### **Problem #2.2.2**

**Independently**, create a line plot using `Released_Year` as the x-axis value and `Average Gross in Year` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Average Gross Money vs. Released Year'`.
* X-axis label.
* Y-axis label including units `'USD'`.

In [None]:
mean_gross = movie_df.groupby(# COMPLETE THIS LINE


---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p2.3"></a>

---
### **Part 2.3: Bar Plots**
---

#### **Problem #2.3.1**

**Together**, let's create a bar plot of the number of movies released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

In [None]:
movies_per_year = movie_df['Released_Year'].value_counts()

plt.bar(movies_per_year.index, # COMPLETE THIS CODE

---

#### **Now it's your turn! Try Problem #2.3.2 on your own.**

---

#### **Problem #2.3.2**

**Independently**, create a bar plot of the number of Dramas released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

<br>

**Hint**: Recall that you can use `.loc[CRITERIA, :]` to find all data matching given criteria and the example in Problem #6 for finding the number of movies realeased per year.

In [None]:
# COMPLETE THIS CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3"></a>

---
## **Part 3: [OPTIONAL] Improving Visualizations**
---

In this section, we will explore several ways to improve upon the visuals we learned to make above.

<a name="p3.1"></a>

---
### **Part 3.1: Improving Scatter Plots**
---

#### **Problem #3.1.1**

We are given average temperature values for the months of the year for two cities: `city_A` and `city_B`.

**Independently**, plot each city's average temperatures. We'll need to make two scatter plots.

Make `city_A` markers blue and `city_B` markers red. Add labels and a legend.

From the graph, which city is most likely located in the Northeast?

In [None]:
city_A_temps = [60,65,67, 70, 77, 84, 94, 101, 90, 82, 62]
city_B_temps = [-11, 14, 25, 32, 55, 73, 87, 92, 82, 66, 53]
months = np.arange(1,12)

# COMPLETE THE REST OF THE CODE

#### **Problem #3.1.2**

**Independently**, adjust the plot so that `city_A` markers are black and `city_B` markers are green.

In [None]:
city_A_temps = [60,65,67, 70, 77, 84, 94, 101, 90, 82, 62]
city_B_temps = [-11, 14, 25, 32, 55, 73, 87, 92, 82, 66, 53]
months = np.arange(1,12)

# COMPLETE THE REST OF THE CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3.2"></a>

---
### **Part 3.2: Improving Line Plots**
---

#### **Problem #3.2.1**

**Independently**, create a line plot with the following features:


* A dashed line
* A grid




In [None]:
Year = [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010]
Unemployment_Rate = [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]

# COMPLETE THE REST OF THE CODE

#### **Problem #3.2.2**

**Independently**, using the following data, create a line plot. In addition:
* Make that line dashed and dotted with `"-."`
* Add a grid to the background

In [None]:
# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]

# COMPLETE THE REST OF THE CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3.3"></a>

---
### **Part 3.3: Improving Bar Plots**
---

#### **Problem #3.3.1**

**Independently**, make a bar plot with each bar as a different color.

In [None]:
langs = ['English', 'French', 'Spanish', 'Chinese', 'Arabic']
students = [23,17,35,29,12]

# COMPLETE THE REST OF THE CODE

#### **Problem #3.3.2**

**Independently**, use the following data to create a simple bar plot. Make all of the bars blue except bar `E`; make bar `E` red.

In [None]:
height = [3, 12, 5, 18, 45]       # y
bars = ['A', 'B', 'C', 'D', 'E']  # x

# COMPLETE THE REST OF THE CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3.4"></a>

---
### **Part 3.4: Enhancing Plot Aesthetics**
---

#### **Run the cell below to import the data for the following problems.**

This dataset contains information on U.S. agricultural exports in 2011.

In [None]:
url = 'https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv'
export_df = pd.read_csv(url)
export_df.head()

#### **Problem #3.4.1**

**Together**, let's compare beef export of different states using a bar plot. Adjust the size of the plot so that the graph and its labels are legible.



**NOTE**: To use a DataFrame for a graph, here is the syntax:
```
plt.bar(DF_NAME['x_variable'],export_df['y_variable'])
```

---

#### **Now it's your turn! Try Problem #3.4.2 on your own.**

---

#### **Problem #3.4.2**

**Independently**, compare the export of corn from different states using a bar plot. Make sure you adjust the size of the plot.

---

# End of Notebook

© 2023 The Coding School, All rights reserved