# Project 1

## PANDAS Introduction

### By ......

In this project, I am going to introduce the genreal technique that we can use from PANDAS package on Python. So here is our first question in mind:

#### What is PANDAS?

[PANDAS](https://pypi.python.org/pypi/pandas/) is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. 

Then we may ask

#### What problem does PANDAS solve?
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Therefore, in this project, here will be 2 things that I am going to do:

  * Introducing the most basic stuffs of how we deal with our own datas and importing data.
  
  * Importing a real stock price data and make prediction.

We import PANDAS by using code
`import pandas as pd`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Part1

Since this package PANDAS is dealing with data, so let's start with how we create with our own data and how we edit our own data.


So now, let me create a imaginary dataset that is about stock price of Microsoft during two week, and it shows the ** Low** and **High** point of its price. Then we type

In [None]:
MS={'Day':[1,2,3,4,5,6,7,8,9,10],'High':[105,110,108,110,115,117,115,119,120,125],'Low':[101,108,100,105,107,108,110,110,115,123]}

Using `pd.DataFrame` we will create a dataframe for the data we have.

In [None]:
df=pd.DataFrame(MS)

In [None]:
print(df)

So by using DataFrame, python will generate a dataframe for us.

Now if we want a small looks of data, what do we do? We use

In [None]:
df.head()

Then it would just generate first 5 rows of dataset from Dataframe.

But now we may say, we if I want the first couple of row?

So we use:

In [None]:
df.head(2)
# to get the first 2 rows of the dataset

By same method, if we want the last couple row, we use


In [None]:
df.tail(2)

If we want just the data of that colum, we can simply type `df.(name of column)`

In [None]:
df.High

In [None]:
df.Low

To make a list for it

In [None]:
df.High.tolist()

If we just want the specific row or rows of the data, we use the code `df.iloc[]`.

In [None]:
df.iloc[1]

In [None]:
df.ix[[0,5,9]]

Then when we want the specific value of that certain column and row, we use the code `df.at`.

In [None]:
df.at[2,"High"] 

Now, let' plot our dataset, and see its general look.

We use the command `df.plot()`.

In [None]:
df.plot()

We notice there is a "0,1,2,...9" on the horizontal axis, thas is our index. If we don't specify an index, everything is treated as a normal column and an index will be generated automatically for us.

And now, you may say, index is starting from 0, that's somehow annoying to visualize, or it's not straighforward, since we are talking about the daily stock market, we want days to be our index, so we can use code `df.set_index('')` to change it:

In [None]:
a=df.set_index('Day')

Then let's see what it looks like again.

In [None]:
a.plot()

Now We can see, the graph is much more nicer when we change the index.

In [None]:
print(a)

After introducing the most basic use for the PANDAS, let's get something slightly advanced.

Let's say we want to see only the information of **High** on some days, then we will need the command `df.ix[[],[]]`.

In the first bracket, you choose what rows you want, in the second bracket, you choose what colums you want.

In [None]:
df.ix[[0,5,9],['Day','High']]

Then now, what if we want to have a new column, that shows the average price of **High** and **Low** price on each day?  To have this new column,we need to give this colum a name, then do the calculation from High and Low price, like below:

In [None]:
df['Median']=(df['High']+df['Low'])/2

In [None]:
df
#Then we can see how the new dataset looks like

Now what if we have two datasets in hand, like we have first 10 days information of stock price, then we have the next 5 days. We want to combine them and analyse them, then we do the following.

In [None]:
MS1={'Day':[11,12,13,14,15],'High':[105,110,108,110,115],'Low':[101,108,100,105,107]}
#So this is our next 5 days data.
#By doing the same thing, make it dataframe.

In [None]:
df1=pd.DataFrame(MS1)

To combine MS and MS1, We use the command `pd.concat`.

In [None]:
pd.concat([df,df1])

We might be wondering, what if we have the different values for the following dataset. For example, we may just have the Open price for the next 5 days. Then what happen?

(By the way, if you think it starts from 0 is annoying, we can always do the step that we just mentioned `df.set_index('Day')`)

In [None]:
MS2={'Day':[16,17,18,19,20],'Open':[105,110,108,110,115]}

In [None]:
df2=df1=pd.DataFrame(MS2)

In [None]:
pd.concat([df,df1,df2])

So we see, there are a lots of NaN in the bracket. That's just how PANDAS handle this two different type of dataset, it is just trying to line up everything it can to make it a whole dataset, to concat it together. At least it will give you a one whole dataframe.

Then the next step is that, we want to extract this new dataframe and make it as a new .csv file. Then we use the following code`.to_csv`.(Of course, there are a plenty of other file type that we can extract".

In [None]:
comb3=pd.concat([df,df1,df2])

In [None]:
comb3.to_csv('allstuffs.csv')

Then after typing the command above, we will see a new file in our Project folder, then if we want to do something with this new dataset, then we can start from there.

## Part2

Now we have the small sense of how we have and handle our own dataset. Then next thing we are going to do is how we munipulate the importing data.

The first thing we want to know is how we import data, by looking at [IO tools](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table), there is plenty of file type that python can use for our importing dataset. So as an example, I will import a dataset which shows the daily stock price of apple for an whole year.(253 days in total)

First of all, to import this dataset, we have to first upload our dataset(csv.file) on the Jupyter, and make sure it's in the folder of our Project Notebook. Then, by using the code `pd.read_csv`, we can load our dataset into this notebook. (Of course, many other files can be imported)

In [None]:
appledaily=pd.read_csv('appledaily.csv')

So for stock price, we may just have more intention to know the most recent stock price, so by the stuffs we knew above.

In [None]:
appledaily.tail(10)

Since stock price is a statistical data, so we may want to know the specific value like the mean of stock's **Open** price, or its maximum and minimum price reached. And there is a very useful command `df.describe()` to show all these things.

In [None]:
appledaily.describe()

We can see from above table, we can easily see the mean values of each category price, the total counts of each, max value etg. It basically gives everything we need for our intro stats.

As I mentioned at the very begining, **"PANDAS enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R."** So if you want these data running and get all of these above in [R](http://www.rcommander.com/) (A software for data analysis), you will need to type a lot more than just `.describe`.

Now we want to have a general look of this stock price, we can do the following by what we learn from class about "matplotlib.pyplot".

In [None]:
plt.figure(figsize=(15,5))
plt.plot(appledaily['Adj Close'])
plt.title("Apple stock price")
plt.xlabel("Day")
plt.ylabel("Price")
plt.show()

As we can see, it seems that that there is a trend for this apple price, as we all want to have some insight for the future price of apple stock, then we need a package that can help to fit our dataset, called **"LinearRegression"**.

In [None]:
from sklearn.linear_model import LinearRegression as Linreg

The way this function works is that it's trying to create a line of best fit, and trying to present how the data changes overtime

This two codes are just how we set our syntax to set our x and y values.

In [None]:
x=appledaily.index.values.reshape(-1,1)
y=appledaily['Adj Close'].values

This two codes are just how we set our syntax to set our x and y values.

In [None]:
reg=Linreg()
#make reg equals to the package that we need.
reg.fit(x,y)
#fit our data, load our data and make it run for this.
y_preds=reg.predict(x)
#set y_preds as our prediction value.

Then we want to see the direct comparision between our prediction line and the actual stock price trend, so we can type by what we learnt in class about plt:

In [None]:
plt.figure(figsize=(15,5))
#setting the graph size.
plt.title("Apple Linear Regression")
#make a title for it.
plt.scatter(x=x,y=y_preds)
#making a scatter plot that corrresponds to our real data.
plt.scatter(x=x,y=y,c="r")
#making a scatter plot that corrresponds to our prediction.

So, since we have the overall 253 days of the data, then if I want to know what may be the next "Adj Close" price for the apple and maybe even 100 days later, we can simply use code `reg.predict()`.

In [None]:
reg.predict(254)

In [None]:
reg.predict(354)

## Suggestion for future learning

There are plenty of online resource for learning PANDAS

1. If you want to master it by reading through the material of it, you can go to [pandas.org](http://pandas.pydata.org/), you will have everything you need there.
2. Or maybe you can watch some youtube [videos](https://www.youtube.com/results?search_query=PANDAS+python) and many other things to learn from others.