# MATH 210 Project I: Data Analysis with `Pandas`

`Pandas` is an open source Python library for data analysis, It is designed for cleaning data, analysis and modelling, organizing the results of the analysis for a suitable drawing or list display form in a fast and flexible way. `Pandas` makes Python be more efficiently  and be greater for analysis. (see the [documentation](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

`Pandas` introduces two main data structures to Python: **Series** and **DataFrames**. Series is one-dimensional and  is mostly used for cases in finance, statistics social science and engineering. DataFrames is tow-dimensional and provides all the data frames and is designed to be integrated in a scientific computing environment.

**My goal** for this tutorial is to explore **three parts** of Pandas by introducing to basic pandas structures and `where` () Method and Masking and to provide explanation of the contents and examples for those who intended to have a better understanding of data analysis.


**PART 1**: Introduction to pandas data structures, with the basics of the library's main data structures: Series 

**PART 2**: Introduction to pandas data structures, with the basics of the library's main data structures: DataFrames 

**PART 3**: Introduction to Index and Selecting Data, focusing more on the `where` ( ) Method and Masking

* `Series` (see the [documentation](http://pandas.pydata.org/pandas-docs/stable/dsintro.html))
* `DataFrames` (see the [documentation](http://pandas.pydata.org/pandas-docs/stable/dsintro.html))
* ` where() Method and Masking` (see the [documentaion](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking))


## Contents

1. `series`
2. `DataFrame`
3. `where` ( ) Method and Masking
4. Exercises

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Series



Series is a one-dimensional object like array and list that can assign a labeled index to each corresponding item. The way to create a Series is to use the general method      $$s = pd.Series(data,index=index)$$

#### The data  can be:
 * ndarray
 * Python dict
 * scalar value

### From ndarray

When `data` is an ndarray, if index is labelled, each item will be assigned to an index lable from 0 to N, where N is the length of the data minus one.

In [None]:
# create an arbitrary list
s = pd.Series(['a','b','c','d','e','f','g','h'])

In [None]:
s

we can specify the index when we write the Series. The index should be the same length as the data

In [None]:
s = pd.Series([90,80,77], index=['Allen','Tom','Cindy'])

In [None]:
s

In [None]:
s = pd.Series(np.random.randn(5))

In [None]:
s

### From dict

If the data is a dictionary, if the index is indexed by the value of the data corresponding to the label will be pulled out. Otherwise, the index will be sorted by the key dictionary. NaN(not a number) represents the missing data introduced in pandas.

In [None]:
d = {'Vancouver' : 1000, 'Victoria' : 500, 'Burnaby': 300, 'Richmond': 200, 'White Rock': None}
cities = pd.Series(d)
cities

We can select items to see the index from the Series

In [None]:
cities['Burnaby']

In [None]:
cities[['Burnaby','Vancouver']]

We can alse use boolean to select specify indexs

In [None]:
cities[cities < 600]

We can also return a series of True/False values

In [None]:
less_than_600 = cities < 600
print(less_than_600)
print('\n')
print(cities[less_than_600])

We can also change the value in a series

In [None]:
print('Old value:', cities['Victoria'])
cities['Victoria'] = 880
print('New value:', cities['Victoria'])

The two Series can be added together, which returns the combination of the two Series and the addition of the shared index value. The value of a sequence that does not share an index will generate a NULL/NaN.

In [None]:
print(cities[['Victoria', 'Richmond', 'White Rock']])
print('\n')
print(cities[['Vancouver', 'Burnaby']])
print('\n')
print(cities[['Victoria', 'Richmond', 'White Rock']] + cities[['Vancouver', 'Burnaby']])

### From Scalar Value

If the data is a scalar value, it must provide an index. This value will be repeated to match the index length

In [None]:
pd.Series(100., index=['Allen', 'Bob', 'Cindy', 'Tom', 'Elina'])

## 2. DataFrames

A data frame is a 2-dimensional data structure consisting of rows and columns, similar to a spreadsheet, database table, SQL table or R data object. You can also share an index (column names) as a set of series objects. Series, DataFrame accepts many different kinds of inputs:

 * Dict of 1D ndarrays, lists, dicts, or Series
 * Structured or record ndarray
 * Other DataFrame

### From dict of Series or dicts

The final indexes are the combination of various indexes. If there are any nested dictionaries, these will translate into a series. If the column is not passed, the list of columns sorted

In [None]:
d = {'first exam' : pd.Series([10., 7., 8.], index=['Allen', 'Bob', 'Cindy']),
'second exam' : pd.Series([9., 9., 5., 6.], index=['Allen', 'Bob', 'Cindy','Tom'])}

In [None]:
df = pd.DataFrame(d)

In [None]:
df

In [None]:
pd.DataFrame(d, index=['Allen', 'Bob', 'Cindy'])

In [None]:
pd.DataFrame(d, index=['Allen', 'Bob', 'Cindy'], columns=['second exam', 'third exam'])

### From dict of ndarrays / lists

The ndarrays must be of the same length. If the index is passed, it must be the same as the length of the array. If not,the result will be range(n),which is the length of the array.

In [None]:
d = {'round one' : [1., 2., 3., 4.], 'round two' : [4., 3., 2., 1.]}

In [None]:
pd.DataFrame(d)

In [None]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

### From structured or record array

Structure or record array is using the same method as a dict of arrays.

In [None]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [None]:
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

In [None]:
pd.DataFrame(data)

### From  other DataFrame

`Dataframe.from_items` works for a dict constructor that takes a sequence of form (key, value) pairs. The key represents the column or row and the value represents the column value or row value. We use `Dataframe.from_items` to construct a data frame in a particular sequence of columns. Passing a list of columns is not necessary.  

In [None]:
pd.DataFrame.from_items([('A', [10, 8, 9]), ('B', [4, 1, 6])])

If you pass `orient='index'`, the keys will be the row labels. In this case, you also need to pass the column names.

In [None]:
pd.DataFrame.from_items([('A', [10, 8, 9]), ('B', [4, 1, 6])], orient='index', columns=['one', 'two', 'three'])

## 3. `where` ( ) Method and Masking

To select a subset of data from a Boolean vector. In order to ensure that the selected output has the same shape as the original data, you can use the where `method` in `Series` and `DataFrame`. (see the [Sample CSV Data]( https://support.spatialkey.com/spatialkey-sample-csv-data/))

In [None]:
# Log in to your Python account 
# upload the data file e.g. “TechCrunchcontinentalUSA.csv“ from your local to the location where your script is
df = pd.read_csv("TechCrunchcontinentalUSA.csv",sep=",",parse_dates=['fundedDate'])

In [None]:
df

In [None]:
# Selecting values from a Series with a boolean vector returns a subset of the data. 
# To return only the selected rows when raisedAmt >6000000
df[df > 6000000]

In [None]:
# select values from DataFrames with a boolean vector returns a subset of the data. 
# To return only the selected rows when raisedAmt < 4000000
df.where(df < 4000000)

In [None]:
# To return only the selected rows when raisedAmt < 4000000
# replace the non-selected rows with NaN 
df.where(df.raisedAmt <4000000)

In [None]:
# Based on the previous step, takes (-1*raisedAmt) for replacement of values where the condition is False
df.where(df.raisedAmt <4000000, -1*df.raisedAmt,axis=0)

In [None]:
# when values in raisedAmt < 4000000, replace the values with 1
df.raisedAmt.where(df.raisedAmt < 4000000,1)

In [None]:
# when values in raisedAmt < 4000000, replace the values with 1 and display the first 29 lines
np.where(df.raisedAmt < 4000000,df.raisedAmt,1)[:29]

In [None]:
# from two previous steps, we can see the values in each array are the same, therefore produce all true 
# when writing DataFrames.where(),we can express df1.where(m, df2) as np.where(m, df1, df2).
# df.where(df < 0, -df) == np.where(df < 0, df, -df)
df.raisedAmt.where(df.raisedAmt < 4000000,1) == np.where(df.raisedAmt < 4000000,df.raisedAmt,1)

In [None]:
# set values based on some boolean criteria
# set values to 0 when df2.raisedAmt < 4000000 
df2 = df.copy()
df2[df2.raisedAmt < 4000000] = 0
df2

By performing the `where`,we can use `axis` and `level` to align the input

In [None]:
df2 = df.copy()
df2.where(df2.raisedAmt<100000,df2['raisedAmt'],axis='index')

In [None]:
df2 = df.copy()
df2.where(df2.raisedAmt<1000000)

We use `mask` to express the inverse boolean operation of `where`

In [None]:
# We obtain the values of raisedAmt < 1000000 by using mask
df.raisedAmt.mask(df.raisedAmt >=1000000)

## 4. Exercises

**Exercise 1**. Create a DataFrame by using np.random.randn, the DataFrame is 100 by 8, and the index uses date.

**Exercise 2**. Create two Series and return the combination of the two Series with some of the value that do not share an index.

**Exercise 3**. Mask function can take a replacement value for all unmasked records. Use mask method to replace all raisedAmt that are less than 4000000 with -1.