# MATH 210 Project I

## Analyzing Data with `pandas`

### by Geun Woo Park

`Pandas` is simply Microsoft Excel in Python. However, regarding the user's ability and degree of understanding it, it can provide quicker, more flexible and expressive data analysis than Excel, which is known as one of the most powerful and effective data analysis/manipulation tools in the world. 

**Goal** in this notebook is pretty simple and basic. Learning the basic idea about `Pandas`. By the end of this project, the readers will be familiar with:
* How to create an object
* How to read/import a data from external source
* How to play around with the data
* How to plot the data with `Pandas` without using other packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Contents

1. Object creation and exploring basic data analyzing functions
2. Importing data from external sources such as `xlsx`, and `csv` files
3. Manipulation of the data
4. Plotting the data with `pandas`
5. Exercises

#### 1. Object creation and exploring basic data analyzing functions

In order to create a default integer index, we use `pd.Series(list_of_data)`. type of data in the list is not limited. It can be text, numbers, or even special letters such as `*`. 

For example:

In [None]:
S = pd.Series([1,2,np.pi,'A',0.2,'*']);
print(S)

If you are to create a table of numbers with an index(index can be any data type) and labeled column, use `pd.DataFrame( data , index , columns , datatype , copy)`. 

For example:

In [None]:
pd.DataFrame({'B : Type' : pd.Categorical(["alcohol" , "food" , "food" , "pop"]),
                       'A : Menu' : pd.Categorical(["Vodka(Bottle)", "Taco Platter", "Assorted BBQ Platter", "Coke(Bottle)"]),
                       'C : Price/CAD' : pd.Series([34.99, 19.00, 25.50, 7.90]),
                       'D : Item sold' : pd.Series([2,4,5,3]),
                       'E : Item left' : pd.Series([3,1,0,2])})

I have created a random context table using `pd.DataFrame`. Now, you could notice that the order of the column does not follow the order of the index names written in the bracket, but the alphabetical order.

Now, we will explore the basic data analyzing functions that we can perform with `pandas` package.

In order to see only portion of the table, you can use `table_name.head()` or `table_name.tail()`. `head()` will show the first five rows of the whole table and `tail()` will show the last five rows of the whole table in default. However, you can choose number of rows that you want to print by inserting the nth number of the row you want to print. If you put `.head(3)`, it will print top 3 rows, and if you put `.tail(6)`, it will print bottom 6 rows. 

On the other hand, if you want to print the table from the first row to the specific row, you can use `table_name[:n]`, where n is the nth row of the table.

For example:

In [None]:
series = pd.Series(np.random.randn(1000))
series.head()

In [None]:
series.tail()

In [None]:
series[:50]

To compute the data statistically, you simply type `.describe()` after the name of the table. Then it will print the size , mean, standard deviation, minimum value, first, second, and third quartiles and the maximum value of the sample. For example:

In [None]:
series.describe()

#### 2. Importing data from external sources such as `xlsx`, and `csv` files

It is simple to import/read a csv(comma-separated values) files or xlsx files into the local notebook. For csv file, you use `read_csv` function. However, the most important point is that you need to upload the CSV file to the repository, and direct the location of the file in the repository.

For example, I create a folder called, 'CSVfiles', in my server, and uploaded a sample CSV file containing insurance sales information. Thus, I could import the CSV file from my server by:

In [None]:
insurance_df = pd.read_csv('CSVfiles/insurance_sample.csv')

In [None]:
insurance_df[:40]

In [None]:
insurance_df.tail()

Sometimes imported CSV file amy print broken. This is because the CSV file's content is not encoded. In that case, you can find solution in the [documentation](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb)

#### 3. Manipulation of the data

With `pandas`, you cannot manipulate the real data since even if you are reading the data in your notebook, it's imported copy. Thus, what we can do is extracting few columns from the imported data, and create a new table that contain necessary data only. This helps the users to analyze the data faster and more accurately.

For example, if I want to create a new table containing county and point_granularity, I do:

In [None]:
# Create new table, county_coordinate that extracted necessary columns from the insurance_df
county_p = insurance_df[['county','point_granularity']]
# Checking if the command goes as I planned
county_p[:50]

### 4. Expressing data with `pandas`

Now, we will plot the data with `pandas` package. Since `pandas` is highly developed data analyzing tool, we can do anything with `pandas` if it's about data analyzing. This means that we can analyze and express the data using `pandas` without composing two packages, `numpy` and `matplotlib`.

#### 1. Bar plots

In [None]:
# Create a random set of data with multiple columns
df1 = pd.DataFrame(np.random.rand(5,4), columns = [ 'a', 'b','c','d'])
df1.plot.bar()
df1

#### 2. Histogram

In [None]:
# Create a random set of data
df2 = pd.Series(np.random.randint(0,10, size = 100))
# and create a histogram with the data
df2.plot.hist(alpha = 0.7)
df2[:5]

#### 3. Pie plots

In [None]:
# Create a random set of data in a series form
df3 = pd.Series(4 * np.random.rand(5), index = ['a','b','c','d','e'], name = 'data')
# Create a pie plot for df3
df3.plot.pie(figsize = (7,7))

As you can notice on the codes above, creating the diagrams with the data is simple. you just need to insert type of data plot behind 'name of **DataFrame**.plot.**diagram type**(). In the bracket, you can put the settings of the diagrams, such as the color of the plots, and size of the plots.

For more information abour the plots, reference the [pandas visualization documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html)

### 5. Exercises using `pandas`

In this section, we will do some simple exercises that can sharpen your `pandas` skills

#### Exercises 1.

**a)** Import `pandas` as `pd`

 **b)** Create a data frame looks like the image below

![image](https://sarahleejane.github.io/assets/simple_table_df.png)

**c)** Show a statistical evaluation about the age of five subjects from above

**d)** Create a bar graph that shows the age of five subjects from part (b)