## LOAD PACKAGES
- import os
- import pandas as pd
- import numpy as np
- import seaborn as sns

## LOAD DATA
###  os commands to obtain file path
- path = os.path.join(os.getcwd(),path1,path2,...)

### read_csv commands
- df = pd.read_csv(path)

## CHECK DATA INFO

### check first few entries
- df.head()

### check length
- len(df)

### check columns
- df.columns

### column attributes' datatype
- df.info()

### overview of data columns
- df.describe() #in types of dataframe

### check data types
- type(data)

### check instructions
- append '?' at the end of the instruction

## CREATING/EDITTING NEW SERIES/DATAFRAMES

### Series
- `series = pd.Series([data1,data2,...], index=[index1,index2,...])`
- without giving specific index, the serie will automatically be assigned numerical index
- index can be identified using either index label or index number

### Dataframes
- `df = pd.DataFrame(columns=[column1,column2,...])`   #create a dataframe with columns
- `df.loc[index lable/numerics] = [data1,data2,...]`   #input rows using list with order
- `df.loc[index lable/numerics] = dict(column1=data1,column2=data2...)` #input rows using dict without order
- `df.loc[row,colume] = data` #change data at certain row and column

...

### Changing column names
> `df.rename(columns={name1:name2,...})`
- this changes the name of current columns

### Dropping a line
- df.drop(index label/numerics, inplace={False/True})
- many instructions create copies of the original df so either replace the original variable or use inplace flag

### Transverse
- df.T

## FILTERING & SELECTING

### filter dataframes: certain columns
> df[attribute]  
- this returns a Series type object

>df[[attribute1,attribute2,attribute3...]]
- this returns DataFrame type object
- use '&' to represent 'and'; use '|' to represent 'or'

- index/name can represent attributes


### filter dataframes: certain rows
> df.loc[labels]

> df.iloc[standar position of index]
- can use slicing

> df[df.some_attributes == something] 
- the conditional statement is masking

> more masking operations
- serie.isin(serie)
- serie.str.contains/any other string operators

### sorting
> df.sort_values(by = 'someattribute', ascending = False/True)


## ClEANING DATA

### Missing/Null of Data Value

#### convert non-numericals to numerics 


> pd.to_numeric(series, errors={'ignore','raise','coerse'}, downcast={None, 'integer','signed','unsigned','float'})
- (when encountering non-numericals, ignore, raise the problem or replace with nan value(coerse))*
  

#### nan value
>type(np.NaN) 
- this outputs 'float'

> pd[columns].isna() 
- this is a mask, which returns series/dataframe of boolean values
- pd[columns].isna().sum() can be used to check if the columns have nan values

#### dropping nan value

> pd.dropna(axis = {0,1}, how={'any','all'}, thresh=N, subset=[col1,col3,...], inplace=[True, False]) 
- this helps to drop a certain column/row
- 'axis': when axis = 0, it means take rows; 1 means take columns
- 'how': when 'how' sets to 'any', then any nan value drops the row, when 'how' sets to 'all', then drop the row if all values are nan
- 'thresh': it indicates at least N values need to be non-nan to be kept
- 'subset': indicates to search specific columns or rows
- 'inplace': do the copy or alter the df in place

#### filling nan value

> df.fillna(value={scalar,dict},method={'bfill','ffill'},axis={0,1},limit=N)
- if don't want to drop data entries, you can replace the values using fillna method
- can directly says df.fillna(0)
- can set value={'col1':value1, 'col2':value2...} to specify which value to replace nan in each column

### Duplicates/Uniqueness of Data Value
#### check uniqueness
>df.nunique()
- this shows the number of unique values in each columns

>pd.unique(series/dataframes)
- this takes in series as input and returns unique values of a column

#### check duplication
>pd.duplicated(subset=[col1,col2,...],keep={'first','last',False})
- this is another mask
- 'subet': this 
- 'keep': this lets you specify which occurence of a duplicate to mark as False, not as a duplicate; False marks all occurences as duplicates

## COMBINING DATA

1. Concatenation

2. Merging

3. Joining

### Concatenation
> pd.concat([df1, df2, df3...], axis={0,1}, join={inner, outer}, ignore_index={False,True})
- here the join specifies the one (index/column) other than the axis: if axis = 0, then inner/outer join columns, and vice versa
            
### Merging
> pd.merge(leftdf, rightdf, on=[some_attributes], how={inner,outer,left,right},sort={True,False})
- merge, other than concat, only takes 2 df inputs
- 'on': specifies the columns/rows to compare with
- 'how': use inner-join as default; left means left-outer-join, which means keep all cols in leftdf, merge matching cols from rightdf and drop those aren't in leftdf

## DRAWING/PLOTTING DATA


### Import
> import matplotlib.pyplot as plt


### Strucutre
> - object hierarchical (tree-like) structure
- the root is "Figure" and a figure can have multiple "Axes" at the level lower than figure.
![title](https://matplotlib.org/1.5.1/_images/fig_map.png)
From the [matplotlib documentation](https://matplotlib.org/1.5.1/_images/fig_map.png).

> - each element of the plot is its own manipulable Python object
![title](https://matplotlib.org/3.2.1/_images/sphx_glr_anatomy_001.png)
From the [matplotlib documentation](https://matplotlib.org/3.2.1/_images/sphx_glr_anatomy_001.png).


### Axes plots
> `plt.plot([1,2,3,4])` 
- this plots points (0,1),(1,2),(2,3),(3,4), where x is assumed to be 0,1,2...

> `plt.plot([1,2,3,4],[1,4,9,16])`
- this plots (1,1),(2,4),(3,9),(4,16)

> `plt.plot(x, y1, 'r--', x, y2, 'bs', x, y3, 'g^')`
- this adds properties to the line: 'r--' means red dash line, 'bs' means blue square, 'g^' means green triangles, 'bo' means blue circle, 'k' means black

> `plt.xlabel('LATEXT NOTATION')`
`plt.ylabel('LATEX NOTATION')`
- label the x and y axis

> `plt.legend(['LATEX NOTATION', 'LATEX NOTATION', 'LATEX NOTATION'],loc={'upper left','upper right','lower left','lower right})`
- Uses Latex notation for legend labels in order

> `plt.xscale('log')`
- X-axis can be log scaled to contain larger range of x

> `plt.show()`
- display the graph

> `np.arange(start,stop,step)`
- create an np array

> - plotting functions using pd.arrange to define x and a def to define y


### Stateful vs. Stateless Interfaces
- Statful: state machine interface
- Stateless: Object-oriented interface
- For almost all functions from pyplot (plt), it either autimatically refer to the current existing axes in figure, or create one if none exists. So plt.plot() actually is the wrapper function for ax = plt.gca() (get current axes) and ax.plot() underneath.
- To modify the underlying object directy, use OO approach by calling methods of an axes object.


### Subfigures & Axes
> `fig, ax = plt.subplots(nrows, ncols)`
- To access different axes, can either use index addressing: ax[row][col].plot(ANY_POINTS)
- or use unpacking: ax1,...,ax10 = ax.flatten()

> `fig, ((ax1, ax2, ax3, ax4, ax5),(ax6, ax7, ax8, ax9, ax10)) = plt.subplots(2, 5)`
- in this way, ax1,ax2...,ax10 can be directly addressed


### Figure size
#### Get figure information
> `fig.get_size_inches()`
- this returns figures size in inch

>`fig.get_dpi()`
- this is conversion between pixel and inch: 72 pixels/dots correspond to 1 inch

>`fig.savefig('FIG_NAME')`
- this saves the figure in the current directory

#### Set figure
> `plt.subplots(nrow,ncols,figsize=(width,height))`
- take figsize as one of the parameters in `plt.subplots()`
> the original size of labels is 10pt. Setting a oversized figsize would cause the scaling to be small. For final visualization, the labels must be legible, at least 8 pt.

>`fig.suptitle('LATEX_NAME',fontsize=n)`
- this sets the title of main plot

>`plt.setp(axes, xlim=[1900, 2017], ylim=[0, 10], ylabel = 'LATEX STRING')`
- plt.setp() is a global setter, 'axes' attribute can be either one axe or several axes


### Various Plots
> > 1. stackplot (can show the relationship for each and the total)<br /><br />
`plt.stackplot(x_value, y1,y2,y3,...)` <br /> 
`plt.legend([y1_label,y2_label,y3_label,...])`

> > 2. bar charts (categorical data representation)<br /><br />
`plt.bar(x_axis_arr, y_axis_arr, color=[c1,c2,..])` - bar vertical<br />
`plt.barh(y_axis_arr,x_axis_arr, color=[c1,c2,..])` - bar horizontal<br />
`plt.set_xticklabels(x_axis_arr, rotation = n, ha={right,left})` - rotate the x labels, with ha stands for horizontal alignment

> > 3. scatterplot (high dimensionality - x,y,size,color)<br /><br />
`plt.scatter(x_axis_arr, y_axis_arr, s=size_arr, c=color, edgecolor=color)`


In [5]:
import os
import pandas as pd
import numpy as np
import seaborn as sns

np.arange(0,10,1)**2

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])