# Introduction Tutorial
### Welcome to a Jupyter Notebook!

This is what a jupyter notebook looks like! This cell is a **markdown** cell. You can edit text or insert images with very simple formatting.

### Before we continue... Some useful Jupyter shortcuts

* shift + enter -> run cell
* esc + m -> make a cell markdown
* esc + a -> insert cell above
* esc + b -> insert cell below
* esc + d + d (double press d key) -> delete cell

(or just use the buttons in the header)

---
# Quick review of Python

We will have a quick review of Python syntax before moving onto Pandas.

1. Data types
2. If Else
3. Loops

In [None]:
# Data types

myInteger = 1
myFloat = 1.2
myString = "I am a string!"
myList = [1,2,3,100,"a random string"]
myDict = {"name": "panda", "favorite_food": "pizza"}
myBool = True

print(myString)

One thing to remember is that we can freely index strings and lists with the bracket notation, like:

In [None]:
#printing second element only of [1,2,3,100,"a random string"]
myList[1]

In [None]:
# printing from sixth character to the end of "I am a string!"
myString[2:]

### 2. Condition Statements

Unlike other languages, we don't need to put brackets around if else conditions. We also don't need curly brackets for the block of code to execute.

In [None]:
if myBool:
    print("myBool value is True")

To check if a variable is not assigned or None, we can simple write `if not some_var` instead of `if some_var is None` or `if some_var == None`

In [None]:
myName = None

if not myName:
    myName = "Al Flower"
    print("my Name is ", myName)
    
else:
    print(myName)

Checking if a list or string contains a certain character is also very easy. You can just check `if a in b`.

In [None]:
if 1 in myList:
    print("1 is in myList")

### 3. Loops

There is a for loop, while loop, and some other iterators.

In [None]:
# a simple for loop

print("For loop: ")
for i in range(5):
    print(i)
    
# this is same as this
print("While loop: ")
j = 0
while j < 5:
    print(j)
    j += 1
    

In [None]:
# simple list iteration
for item in myList:
    print(item)

In [None]:
# simple dictionary iteration
for key in myDict:
    print(key + ": " + myDict[key])

---

# Now on to pandas 🐼

Let's first import the `pandas` package and give it a shorthand `pd` so it's easier to call it. You can name it whatever you want but it's usually standardized to `pd`.

Same for `matplotlib`!


### Getting help

To open up a documentation window about specific methods, you can run `pd.some_method?`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
print(pd.__version__)

## DataFrame

What makes pandas so special is its use of dataframes to manipulate, clean, and understand datasets.

DataFrames are like tables while Series are like single-column tables/vectors.

In [None]:
# Try running this cell to see what the documentation says about dataframes!
pd.DataFrame?

## Creating DataFrames

There are multiple ways to read in data to a dataframe. Pandas reads in most data types beautifully.

Most common ways are:
* `pd.read_csv("some_file_name.csv")` method; you can also read in .tsv, .txt etc.
* `pd.read_json("some_json.json")` method
* `pd.read_excel("some_excel.xlsx")` method
* convert a list or a dictionary into a dataframe

Here we will create a sample dataframe with a dictionary and look at some default methods like
* head() & tail()
* shape
* drop()
* set_index()
* dtypes
* mean(), std(), describe()

In [None]:
df = pd.DataFrame({"StudentCount2018": [None, 8, 32, 60],
                   "HeadFaculty": ["Craig", "Taneli", "Godfried", "Herves"],
                   "Building": ["C3", "A6", "A2", "A5"],
                   "FacultyCount": [5, 6, 9, 15],
                   "ClassesOffered": [9, 7, 9, 13],
                   "Major": ["Interactive Media", "Philosophy", "Computer Science", "Economics"]})

In [None]:
df

In [None]:
# summarizes the data types of all columns
df.dtypes

In [None]:
# head() and tail() display the first and last n number of rows; default is 5
df.head()

In [None]:
# sets the index as the Major
df.set_index("Major")

In [None]:
# drops rows or columns; to drop columns you need to specify axis=1
df.drop("ClassesOffered", axis=1)

In [None]:
df.mean()

In [None]:
df.describe()

In [None]:
# sort values by specific columns
df.sort_values('StudentCount2018', ascending=False)

## 🐛🐛🐛🐛🐛 COMMON BUG 🐛🐛🐛🐛🐛

Often, we forget to save a modified dataframe to itself or to a new variable. Always remember to do either

* `df = df.sort_values()`

OR

* `df.sort_values(... inplace=True)`

This applies to not only `sort_values()` but also `drop()`, `set_index()` etc.

---

## Selecting Data 

Indexing specific columns, rows and cells is a very important part of data manipulation, as you will see.

There are multiple ways of indexing.

### Columns


In [None]:
# square bracket notation
df['HeadFaculty']

In [None]:
# .ColName notation
df.HeadFaculty

In [None]:
# print only up to 2 rows
df[:2]

### Selecting by label

To select rows by index of the label, we use df.loc[  ].

In [None]:
major_df = df.set_index("Major")

In [None]:
# We can select a row by the label of a index
major_df.loc["Philosophy", "HeadFaculty"]

In [None]:
major_df.loc["Interactive Media", "StudentCount2018"]

### Selecting by position

To select rows by their index, we use df.iloc[]

In [None]:
# print 3rd column of 1st row
df.iloc[1, 3]

In [None]:
# print the 2nd to 5th columns of the 2nd to 4th (not including 4th) rows 
df.iloc[1:3,1:4]

### Selecting by boolean

Select rows only if they meet certain condition i.e. evaluate to `True` to some condition. Pretty useful for large datasets!

In [None]:
# print rows only if there are more than 10 students enrolled
df[df.StudentCount2018 > 10]

---
## Handling missing data

In real life data sets, we often get missing or incorrect data values. Pandas has some useful methods for handling those cases. For example:

* `dropna()`: drops rows that contain NA values
* `fillna(value=valToReplaceWith)`: replaces NA values with some other number like 0, the average, the value of the adjacent cell, a string etc


In [None]:
# See if there's any NA value for a specific column
pd.isna(df.StudentCount2018)

In [None]:
# Just exclude rows that contain NA vals
df.dropna()

In [None]:
df.fillna(0)

In [None]:
df.fillna("NO MAJOR STUDENT")

In [None]:
df.fillna(df.mean())

In [None]:
df.fillna(df.min())

### Simple Plot

We can use a default .plot() method in pandas to visualize the dataframe.

In [None]:
df.plot?

In [None]:
df.plot(x="Major", y="FacultyCount",
                 kind="bar", title="Number of Faculty per Major")

### Matplotlib Plot

Or we can use matplotlib library which offers much more customizations.

Simple bar chart won't be a huge difference but once there are much more complicated plots matplotlib and other libraries come in very handy.

In [None]:
plt.bar(df.Major, df.FacultyCount, color=['r','c','b','y'])
plt.title("Number of Faculty per Major")
plt.show()

---
# Now we are ready to look at real datasets!

We can see how all of these that we looked at can be used to understand, clean, and train on a real life datatset.