# Pre Data Challenge Tutorial
***

#### A Beginner's Guide to Analyzing Data in Python

###### This tutorial demonstrates the differences between series and data frames, how to clean data in data frames, and how to make basic visualizations in PyCharm with matplotlib.

## Fill out this form when you come in!
## https://forms.gle/H7Euza2TSyUNtAZd9
***

In [52]:
import pandas as pd

#### Making a Series

In [53]:
bananas = pd.Series([0, 3, 4, 6, 7])
bananas.name = "Bananas"

bananas

#### Making a Data Frame

In [54]:
df_data = {
    "apples": [3, 2, 0, 1],
    "oranges": [2, 4, 1, 0]
}

purchases = pd.DataFrame(df_data)

purchases

#### Re-Indexing a Data Frame

In [55]:
purchases = pd.DataFrame(df_data, index=['June', 'Robert', 'Lily', 'David'])

purchases

#### Importing a Csv File into a Data Frame

###### index_col=0 indicates that the first column of the dataset should be the indexing column instead of the non-meaningful default 1-length system

In [56]:
cereal = pd.read_csv("cereal.csv")
cereal.head(10)

In [57]:
cereal.tail(5)

#### Info, Describe, and Shape

In [58]:
cereal.info()

cereal.describe()

In [59]:
cereal.shape

In [60]:
cereal.columns

In [61]:
# case sensitive, so printing columns before is helpful to see what you need to change
cereal.rename(columns={
        'Calories': 'Cals', 
        'Cups per Serving': 'Cups',
        'Potassium': 'K'
    }, inplace=True)


cereal.columns


#### Editing Multiple Columns at Once

In [62]:
cereal.columns = [col.upper() for col in cereal]

cereal.columns

#### Append() and Handling Duplicates

In [63]:
temp_df = cereal.append(cereal)

temp_df.shape

In [64]:
temp2_df = temp_df.drop_duplicates()

temp2_df.shape

#### Dealing with Null Values

##### You have two options: 
1. delete rows or columns with null entries
2. replace nulls with non-null values (imputation)  

##### First you want to check for nulls:

In [65]:
cereal.isnull().sum()

##### There are 6 missing values in the RATING column!

###### Removing null data is only suggested if a small portion of the data is null. Since there are only 77 rows, we probably want to keep that data.
###### However, to remove the rows with nulls, use this command:

In [66]:
cereal_remove_rows = cereal.dropna()
    
cereal_remove_rows.isnull().sum()


###### To remove the columns with nulls, use this command:

In [67]:
cereal_remove_cols = cereal.dropna(axis=1)

cereal_remove_cols.isnull().sum()

##### We can impute nulls with another value, usually the mean or the median of that column.
###### First, we need to get the median or mean of the column with the null value(s)
###### This begins with grabbing the column as a Series

In [68]:
ratings = cereal['RATING']
ratings.head()

In [69]:
ratings_mean = ratings.mean()
ratings_mean

##### We can see the nulls in the cereal column below for reference

In [70]:
cereal.isnull().sum()

###### By using inplace=True, the fill occurs in the original cereal data frame

In [71]:
ratings.fillna(ratings_mean, inplace=True)

cereal.isnull().sum()

#### Understanding Your Variables
##### This is an important step before finding insights and creating visualizations

In [72]:
names = cereal['NAME']
names.describe()

In [73]:
names.value_counts()

#### Finding Correlations in Your Data

##### We can recall from statistics that correlations range [-1, 1]

##### -1 means strong negative, 0 means no correlation, and 1 means strong positive

###### Looking at the table below, we can find the intersection of K and Fiber. There is a 0.9 correlation there! Interesting.

In [74]:
cereal.corr()

#### Selecting, Slicing, and Extracting

###### Extracting a Column as a Series

In [75]:
cals = cereal['CALS']
type(cals)

###### Extracting a Column as a Data Frame

In [76]:
cals_df = cereal[['CALS']]
type(cals_df)

###### Creating a Subset of a Data Frame by Column

In [77]:
nutrients = cereal[['CALS', 'PROTEIN', 'FAT', 'SODIUM', 'FIBER', 'CARBS', 'SUGARS', 'K', 'VITAMINS']]
nutrients


In [78]:
nutrients2 = cereal.drop(['NAME', 'WEIGHT','RATING', 'CUPS'], axis=1)
nutrients2.columns

#### Extracting Rows from a Data Frame
##### .loc - locates by name  
##### .iloc- locates by numerical index

In [79]:
cereal.iloc[0]

##### To make use of .loc, we need to index cereal by Name

In [80]:
cereal_by_name = pd.read_csv("cereal.csv", index_col=0)
cereal_by_name

In [81]:
cereal_by_name.loc['Cheerios']

##### You can also grab a range of values with .iloc [start, end) and .loc [start, end]
###### Notice how .iloc does not include 20 by .loc includes Count Chocula

In [82]:
cereal.iloc[10:20]

In [83]:
cereal_by_name.loc['Cheerios': 'Count Chocula']

#### Conditional Selections

In [84]:
good_ratings = (cereal['RATING'] >= 50)
good_ratings

##### Looking at Trues and Falses for each row isn't the most helpful, so we should subset the trues instead.

In [85]:
cereal[cereal['RATING'] >= 50]

###### Use .count() to get the number of rows in a selection
###### Try removing ['NAME'] to see why it was added

In [86]:
cereal[cereal['CALS'] <= 100]['NAME'].count()

##### Try some other selections from cereal using the conditionals |, &&, <, >, ==:

***
# Matplotlib
#### We'll go through a brief overview of plotting in Python, though there are other tools you may want to use during the Data Challenge

##### Using ! in front of a command lets it run as if it was in a terminal

In [87]:
!pip install matplotlib

In [88]:
import matplotlib as plt

# set font and plot size to be larger
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) 

#### Why it was important to understand your variables earlier:
##### - For categorical variables utilize Bar Charts and Boxplots.
##### - For continuous variables utilize Histograms, Scatterplots, Line graphs, and Boxplots.

###### Try running the below code with and without a semicolon at the end

In [89]:
cereal.plot(kind='scatter', x='K', y='FIBER', title="Potassium x Fiber");

In [90]:
cereal['RATING'].plot(kind='hist', title='Ratings');

In [91]:
cereal['RATING'].plot(kind="box");

***
# Thank You For Coming!
### We hope you learned some stuff(?)

## Please fill out our survey: https://forms.gle/7YHY8EsnSkxAk4AS9 