<h1><center>Intro to Pandas for Data Analysts</center></h1>
<h3><center>(PART 1)</center></h3>

![Pandas](images/pandas.png)


## Agenda

- How to fetch down a repo from Github (follow along) 
- Jupyter Interface (Hot Keys) 
- Pandas Package 
- Read in Data 
- Explore the Data 
- Summary Statistics 
- Data Visuals
- Variables (Features)
- Write out data to csv file 

## Jupyter Hot Keys (most work in Colab too!) 

- `esc` + `a` (add a cell above) 
- `esc` + `b` (add a cell below) 
- `esc` + `d` + `d` (delete a cell)
- `esc` + `z` (undo)
- `esc` + `m` (markdown cell for text) 
- `esc` + `y` (code cell)
- `shift` + `return` (run a cell)

**hitting `esc` + `h` or go to HELP in the navigation bar to view all shortcuts**

## What is pandas 

- Pandas is a python **package** (or library) that has built in **data manipulation** and **data exploration** functionality 
- A **package** contains classes, objects, methods, and attributes (yes that's a lot) 
- **Packages** help us **NOT** reinvent the wheel and use code that has already been packaged up nicely for us :) 

- Pandas has **two** main **objects** -- `DataFrames` and `Series` 
- A DataFrame is a 2D-Matrix (it has rows and columns) 
- A Series is just a single column 

*Special Note:  Pandas is built off of Numpy*

*Always view package documentation: [Documentation Here](https://pandas.pydata.org/docs/reference/frame.html)*

## What is an object

- EVERYTHING...variables, models, packages, etc are all **objects**
- **Objects** are essential in object-oriented programming 
- Objects are part of **classes** and have specific **methods** and **attributes**
- **Methods** are like actions 
- **Attributes** are like properties or characteristics 

- As said above we will be using **methods** and **attributes** that apply to **DataFrames** or **Series** objects 


![Basketball-OOP](images/oop.png)

## The DATA 

- Dataset is the `students.csv` file in the data folder taken from [Kaggle](https://www.kaggle.com/datasets/erqizhou/students-data-analysis?resource=download)
- It is a fictional dataset, however, the goal is to analyze the data to see what may impact a student's probability in applying for graduate school 
- Race is a censored feature (still may be useful)
- Assume the higher the math score the better the academic performance in that math subject 
- `form1-form4` columns are censored and represents a students background (data dictionary can be found on Kaggle link above)

- The target variable, y, is 0: failed to apply, 1: applied within country, 2: applied abroad

### Step 1:  Import pandas package 

In [None]:
#pd is considered an alias so we don't have to type pandas each time

import pandas as pd 

### Step 2:  Read in a `.csv` file as a Pandas DataFrame Object 

In [None]:
students = pd.read_csv('data/students.csv')

### Step 3: Explore the data 

- We will use **METHODS** first

*Recall:  Many methods end with a set of parentheses and perform some ACTION* 

First we will look at the **top** and **bottom** of the DataFrame

In [None]:
students.head()

In [None]:
students.tail()

Next let's explore some INFORMATION about the data using the `.info()` method.  Can you think of how we would code this?

In [None]:
#Use .info() to explore the data 



What all does the info method tell you?
Are there any missing values?
Do Datatypes make sense? 


Write some markdown in this block

Now let's look at some **ATTRIBUTES** (properties)

*Recall many attributes don't end with closing parentheses and are characteristics of an object.  The object is still a dataframe*

We will start by using the `.shape` attribute.

In [None]:
students.shape

#What is the output telling you? 

Now, look at the `.columns` attribute.  What is the output?

In [None]:
#Use .columns on the students data 


### Step 4:  Use Statistical Summaries to further explore the data

- Summary statistics usually include the count, min, max, standard deviation, mean, and more 
- It is important to look at summary statistics across many subsets of your data (stats by group, gender, location, etc.) 

We will start by using the `.describe` method to obtain summary stats for numeric features 

In [None]:
#T means transpose -- flip the rows and columns 

students.describe().T

#Anything stand out? Add a markdown cell and write out your main takeaways 

The `.crosstab` function can be used to create frequency tables for categorical variables. You can call a specific variable using this syntax: 

**dataframe['ColumnName']**

Let's create a crosstab of `gender` and `class`

In [None]:
#Creates a crosstab of gender and class 

pd.crosstab(students["gender"], students["class"])

The `.groupby()` method allows you to get summary statistics of a categorical variable. Combining this with a statistical measure function (i.e. `.mean()`, `.max()`, `.sum()`, etc) will give you that summary across all features  

Let's groupby `class`

In [None]:
students.groupby(by='class').mean()

The `.value_counts()` function allows you to do a quick count of each level **WITHIN** a categorical variable. 

Let's look at the indicator variable, y 

In [None]:
students['y'].value_counts()

#What does this tell us? 

### Step 5: Visualize the data 

- Matplotlib is a foundational data viz package 
- A more 'fancier' package is Seaborn which is built off of Matplotlib 
- You can also plot using **pandas** 
- Visualizing data can help us tell **stories** and **data storytelling** is very important 

Let's use a **bar chart** to visualize the `class` and `gender` crosstab we created above

In [None]:
#Importing just in case we need it but we will try plotting with pandas 

import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
#Bar Chart Made in Pandas 

pd.crosstab(students['class'], students['gender']).plot(kind="bar")

Let's create a boxplot of the `GPA` feature using seaborn

In [None]:
sns.boxplot(x=students["GPA"])

#Do we have any outliers?

And a histogram of the `GPA` variable using seaborn 

In [None]:
sns.histplot(students["GPA"])

What about a scatterplot between `Calculus1` and `Calculus2` using matplotlib

In [None]:
fig, ax = plt.subplots()

ax.scatter(students["Calculus1"], students["Calculus2"])
plt.show()

### Step 6:  Creating Variables (Optional) 

- Let's create an average calculus performance variable for our data called `Avg_Calc`

In [None]:

students['Avg_Calc'] = (students['Calculus1'] + students['Calculus2'])/2

#Did it work?

In [None]:
#Check the head of the data 


### Step 7:  Dropping Variables (Optional) 

- Our client doesn't want to consider the class `Functional_analysis` for this project -- let's drop it using the `.drop()` method.

- If we are dropping a **row** we use `axis = 0` as an argument within our drop function
- If we are dropping a **column** we use `axis = 1` as an argument within our drop function 

In [None]:
students = students.drop("Functional_analysis", axis=1)

#Did it work?

In [None]:
#Let's check the head 



### Step 8:  Write out your new dataframe to a csv

- This helps you not have to go back and rerun all the cells each time you want to work with your changed data 
- Always save it as a **NEW FILE** as to not overwrite the original 


In [None]:
students.to_csv("new_students_04102023.csv")

## Summary 

- Pandas is a great package for data exploration and manipulation 
- Everything in Python is an OBJECT -- objects have specific methods and attributes 
- Statistical summaries are great in exploring your data 
- Visuals can help explore your data more and create a clearer picture 
- Variables can easily be created or dropped by using Pandas 
- Spend tons of time knowing what is in and NOT IN your data :) 

<h1><center>The End</center></h1>
<h2><center>@LearningwithJelly</center></h2>

In [None]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": "white",
        "transition": "fade",
        "start_slideshow_at": "selected",
        "footer": "  <h6>Learning with Jelly</h6>",
        "header": "  <h3>Intro to Pandas - Part 1</h3>",
        "width":  "90%",
        "height": "110%",
        "enable_chalkboard": True
     }
)