# Thurs, 05 Aug - Python / Pandas Review

***To prepare for today, please either clone this notebook locally or create a blank notebook that you can work in!***

---

- Our Phase 1 project will ask you to demonstrate:
    - The workflow of the <a href='https://learning.flatironschool.com/courses/3567/pages/the-data-science-process?module_item_id=262598'>***Data Science Process***</a>
    - <a href='https://github.com/cwf231/StudyGroups/blob/main/StudyGroupNotes/Phase_1/1.99_data_visualizations.ipynb'>***Making compelling visualizations***</a> 
    - Creativity in *asking* and *answering* questions
    - Presentation-skills


- It will also require you to demonstrate the skills you've been building in Python and Pandas.
    - Reading csvs into Pandas
    - Merging DataFrames
    - Cleaning Data
    - Filtering / Grouping Data
    - Feature Engineering
    - Using Python visualization tools 
        - *(Matplotlib / Seaborn / Plotly)*
        
        
*Below is a Checklist for you to keep on hand as you're starting to look at the project assignment (released tomorrow!).*

---

## Phase 1 Project Checklist
### Business Understanding: 
***Notebook clearly explains the project’s value for helping a specific stakeholder solve a real-world problem.***
- Introduction explains the real-world problem the project aims to solve
- Introduction identifies stakeholders who could use the project and how they would use it
- Conclusion summarizes implications of the project for the real-world problem and stakeholders 

### Data Understanding: 
***Notebook clearly describes the source and properties of the data to show how useful the data are for solving the problem of - interest.***
- Describe the data sources and explain why the data are suitable for the project
- Present the size of the dataset and descriptive statistics for all features used in the analysis
- Justify the inclusion of features based on their properties and relevance for the project
- Identify any limitations of the data that have implications for the project
 
### Data Preparation: 
***Notebook shows how you prepare your data and explains why by including…***
- Instructions or code needed to get and prepare the raw data for analysis
- Code comments and text to explain what your data preparation code does
- Valid justifications for why the steps you took are appropriate for the problem you are solving

### Data Analysis: 
***Notebook promotes three recommendations for choosing films to produce.***
- Uses three or more findings from data analyses to support recommendations
- Explains why the findings support the recommendations
- Explains how the recommendations would help the new movie studio succeed

### Visualization: 
***Notebook includes three relevant and polished visualizations of findings that…***
- Help the project stakeholder understand the value or success of the project
- Have text and marks to aid reader interpretation, such as graph and axis titles, axis ticks and labels, or legend - (varies by visualization type)
- Use color, size, and/or location to appropriately facilitate comparisons
- Are not cluttered, dense, or illegible
 
### Code Quality: 
***Code in notebook and related files meets professional standards (e.g. PEP 8)***
- Code is easy to read, using comments, spacing, variable names, and function docstrings
- All code runs and no code or comments are included that are not needed for the project 
- Code minimizes repetition, using loops, functions, and classes
- Code adapted from others is properly cited with author names and location of the cited material
 
### GitHub Repository: 
***Project repository demonstrates professional “best practices”:***
- README.md includes concise summary of project with all data science steps
- README.md links to presentation and sources
- README.md includes instructions for navigating the repository
- Files and folders are named briefly and descriptively, with consistent naming conventions
- Files and folders are organized logically and consistently
- Commit history includes regular commits with informative commit messages
- Large or sensitive files are listed in .gitignore and not pushed to GitHub
 
### Presentation Content: 
***Presentation clearly demonstrates the value of the project to stakeholders by…***
- Using plain language and clear visuals accessible to non-technical stakeholders
- Describing the project goals, data, methods, and results
- Explicitly connecting the descriptions of the project to stakeholder needs

### Slide Style: 
***Slides have a professional style, such that...***
- Slides use a professional template
- Slides are not cluttered
- Slides are light on text
- Slide text is easily readable
- Visuals are easy to understand
 
### Presentation Delivery: 
***Deliver your presentation clearly and engagingly by...***
- Speaking at a moderate volume and pace
- Describing your project simply and succinctly in about 5 minutes
- Using pauses, emphasis, or other variation in your speaking throughout the presentation
- Having a distinct introduction and conclusion
 
### Answers to Questions: 
***Answer questions clearly and appropriately by...***
- Directly addressing all aspects of the question that was asked 
- Responding accurately, succinctly, and in plain language
- Being sensitive to the knowledge level and interests of the asker
- Explaining any reasons why you cannot fully answer a question


# Practice!
### 0. Load Data - Together

<a href='https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html'>*Matplotlib Style Options*</a>

In [None]:
# Imports!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use(['tableau-colorblind10', 'seaborn-talk']) # Choose your favorite!

In [None]:
# Load data - let's play with the mpg dataset.
df = sns.load_dataset('mpg')
df.head()

In [None]:
df.info()

### 1. Missing Values
- *Are there missing values? If so, how many and where?*

**Drop missing values**

### 2. Create a visualization
***Show how `weight` impacts `mpg`.***

> (Hint: **continuous variable** compared to another **continuous variable**.)
> - Include a title and axis-labels.
> 
**Comment on what the visualization shows.**

### 3. Create a visualization

***Show how `origin` impacts `mpg`.***
> (Hint: **Categorical variable** compared to **continuous variable**.)
> 
> - *There are two simple ways to visualize this relationship. What are two ways to show the distribution of a population?*

**Comment on what the visualization shows.**

### 4. Filter the DataFrame.
***Create a new DataFrame with only cars made in the USA. Store it in a variable.***

### 5. Write a function in Python.

***The function should take a year-abbreviation (as in the column `model_year`) and return the decade the model was made in.***

> - *You decide how to format the output specifically, however the output should be* ***formatted as a string.***
> - *Test your function manually with a couple different `model_year` inputs.*

### 6. Use the function!

***Use the function to create a new column for your DataFrame.***

*Call the new column `model_decade` and add it to the existing DataFrame.*

### 7. Aggregate your data.
***Show the $\LARGE mean$ and $\LARGE count$ of each column for the different categories of `model_decade`.***