# Python and Its Data Scienced Libraries
- Additional Resources:
    - [Python for Everyone](https://www.py4e.com/lessons)
    - [Python for Data Analysis, 2nd Edition](https://wesmckinney.com/book/)



## Python Tutorial

In [None]:
# Loops
for letter in ['A', 'B', 'C', 'D', 'E', 'F']:
    print(letter) # all statements in the for loop should be indented
print(letter) # THThisis line will only be executed when the loop ends

In [None]:
# The range() function can produce a sequence of integers
for i in range(5, 12): # it contains integers from 0 (inclusive) to 10 (exclusive)
    print(i)

In [None]:
# Functions
def get_sum(list1):
    """
    Returns the sum of list1.
    """
    sum = 0
    for val in list1:
        sum += val
    return sum

In [None]:
# Function parameters can have a default value.
def add(x, y=30, z=50):

    return x + y + z

In [None]:
# Classes
class Person():

    def __init__(self, name, age): # This is the name of class constructor. There are two underscores before
                                    # and after the word "init"
        self.name = name
        self.age = age

    def print_info(self):
        print(self.name + " | " + str(self.age))

In [None]:
me = Person("Liang", 34)
me.print_info()

In [None]:
# list
py_list = ["Alex", 29, 130.5, True, False, [1, 3, 5]]
print(py_list[0])
print(py_list[5][0])
print(py_list[-1])
print(py_list[-2])
print(py_list[0:3]) # this will extract elements with index 0, 1, 2
print(py_list[:3])
print(py_list[5:])
print(py_list[:])

In [None]:
# dictionary
py_dict = {}
py_dict['Alex'] = "A"
py_dict['Bob'] = 'B'
print(py_dict)

## Pandas Data Frames

[Pandas](https://pandas.pydata.org/) is a major tool for data scientists on Python. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

References:
- Chapter 5: Getting Started with Pandas

In [None]:
import numpy as np
import pandas as pd

Pandas provides two data types that extend numpy arrays:
- **Data Series**: extending 1D array, used to represent a single feature
- **Data Frame**: extending 2D array, used to represent a data table

We will focus on data frames today, as most data sets are stored in table format.

In [None]:
# Define a DataFrame from scratch
df1 = pd.DataFrame(np.random.rand(5, 3),
                   columns=['Feature1', 'Feature2', 'Feature3'])
df1.head(5) # prints the first several rows

In [None]:
# Print the shape of the data frame
print(df1.shape)

In [None]:
# Row indices
# print(df1.index)
print(df1.index.values)

In [None]:
# Column indices
print(df1.columns.values)

In [None]:
# Access elements using .loc[row_index, col_index]
# Ex: Print the Feature1 value on the first row

df1.loc[0, 'Feature1']

In [None]:
# Index slicing
# Ex: Print the Feature2 value for the first 3 rows
df2 = df1.loc[0:2, ['Feature2']] # the end index is actually inclusive
df2

In [None]:
# Modifying df2 does not affect df1
df2.loc[0, "Feature2"] = 12345
df2

### Basic Table Operations
- Change a value
- Add a new row
- Add a new column
- Remove a row
- Remove a column

In [None]:
# Example:
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores

In [None]:
# Change Alice's final score to 90.

scores.loc["Alice", "Final"] = 90
scores

In [None]:
# Add a new row: "Edward": [77, 88, 99]
scores.loc['Edward', :] = [77, 88, 99]
scores

In [None]:
# Add a column "ExtraCredit"

scores['ExtraCredit'] = [1, 2, 3, 4, 5] # the length must match with other columns
scores

In [None]:
# A data frame with non-unique row indices
df2 = pd.DataFrame(np.random.rand(5, 3),
                   index=[0, 1, 1, 3, 4])
df2

In [None]:
# Remove the record of Chris

scores.drop('Chris', inplace=True)
# scores = scores.drop('Chris') # These two statements are equivalent
scores

In [None]:
# Remove column "ExtraCredit"

scores.drop('ExtraCredit', axis=1, inplace=True)
scores

In [None]:
# Remove both David and Edward

scores = scores.drop(['David', 'Edward'])
scores

### Table Arithmetics
- Perform an operation uniformly to all values in a column
- Arithmetics with multiple columns
- Calculate statistics
- Apply a user-defined function to all rows

In [None]:
# Double the extra credits
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores['ExtraCredits'] = [1, 2, 3, 4]
print(scores.ExtraCredits)
scores


scores.ExtraCredits = scores.ExtraCredits * 2
scores

In [None]:
# Calculate grades:
#   Grades = Quiz1 * 25% + Quiz2 * 25% + Final * 50% + ExtraCredit

scores['Grade'] = scores['Quiz1'] * 0.25 + scores.Quiz2 * 0.25 + scores.Final * 0.5 \
                    + scores.ExtraCredits
scores

In [None]:
# Calculate the min, max, mean, median, variance, and std of the final grades

scores['Grade'].max()
scores['Grade'].describe()

In [None]:
# Ex: Define a function that determines whether the student passes the class.
def is_passing(grade):
    return (grade >= 60)


# Apply the function to the final grade column
scores['Passed?'] = scores['Grade'].apply(is_passing)
scores


In [None]:
# We can define the function using lambda expression
scores['Passed?'] = scores['Grade'].apply(lambda x:(x >= 60))
scores

In [None]:
def is_passing2(row):
    return (row['Grade'] >= 60)

scores.apply(is_passing2, axis=1)

## Plotting Tools

Making informative visualizations of data is one of the most important tasks in data analysis.
- Learn the distribution of data
- Explore trends and patterns in data
- Identify outliers
- Generate ideas for modeling
- Present your findings

Today, we will study how to create several most frequently-used types of plots in Python.
- Scatter plots
- Bar plots
- Histograms
- Pie plots

Reference:
- Python for Data Analysis, Chapter 9

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

A **scatter plot** uses dots to represent values for two numerical variables. The position of each dot represents an instance of data. Scatter plots are helpful for identifying relationships between variables.

In [None]:
# A simple example of scatter plots
# Source: https://www.who.int/growthref/hfa_boys_5_19years_z.pdf?ua=1
heights_boys = pd.DataFrame({'Age': range(5, 20),
                   'Height': [110, 116, 122, 127, 133, 137, 143, 149, 156, 163, 169, 173, 175, 176, 176.5]})
heights_boys

In [None]:
# Plot Age vs. Heights
plt.plot(heights_boys['Age'], heights_boys['Height'])

In [None]:
# Add descriptions to the figure
plt.plot(heights_boys['Age'], heights_boys['Height'], 'r.') # r means red color, . means using a dot for each point
plt.title("Average Height for Boys")
plt.xlabel("Age")
plt.ylabel("Height (cm)")

In [None]:
# Multiple sequences of data
heights = pd.DataFrame({'Age': range(5, 20),
                        'BoyHeight': [110, 116, 122, 127, 133, 137, 143, 149, 156, 163, 169, 173, 175, 176, 176.5],
                        'GirlHeight': [109.6, 115, 121, 126.5, 132.5, 139, 145, 151, 156, 160, 161.7, 162.5, 162.8, 163, 163.2]})
heights

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(heights['Age'], heights['BoyHeight'], 'r^', label="Boys")
plt.plot(heights['Age'], heights['GirlHeight'], 'gs', label='Girls')
plt.title("Average Heights")
plt.xlabel("Age")
plt.ylabel("Height (cm)")
plt.legend()

**Bar plots** are useful for presenting labeled data.

In [None]:
df = pd.DataFrame([[67, 76],
                   [78, 87],
                   [89, 98],
                   [90, 95]],
                  index=['Alice', 'Bob', 'Clare', 'David'],
                  columns=['Midterm', 'Final'])
df

In [None]:
df['Midterm'].plot.bar(color='r', figsize=(10, 6))

In [None]:
df[['Midterm']] # Put the column in a list will make the result a data frame.

In [None]:
df[['Midterm', 'Final']].plot.bar()

In [None]:
df[['Midterm', 'Final']].plot.bar(stacked=True)

In [None]:
df[['Midterm', 'Final']].plot.barh(stacked=True)

**Histograms** are useful for showing the distribution of a variable
- Each bar cover a range of values.
- The height of each bar represents the number of data in the corresponding range.
- Boundary values are counted towards the left bar by convention.

In [None]:
# Generate 100 values using np.random.rand()
# df = pd.DataFrame(np.random.rand(10000), columns=['Rand'])
df = pd.DataFrame(np.random.randn(10000), columns=['Rand'])
df

In [None]:
df['Rand'].hist(bins=50)

**Pie Plots** are useful for showing the proportion of values.

In [None]:
df = pd.DataFrame([5, 10, 20, 7, 3],
                  index=['A', 'B', 'C', 'D', 'F'],
                  columns=['Students'])
df

In [None]:
df['Students'].plot.pie(autopct='%.2f', figsize=(6, 6))
plt.title("Grade Distribution")