# Pandas
* Pandas in the most popular python library for data analysis. 
* To use it we just import its library
* Usually we call it pd as an acrynom

In [2]:
import pandas as pd

* There are two core objects in pandas
    1. DataFrame
    2. Series.


# 1. Data Frame
* Data frame is a table.
* It contain of multiple columns and multiple rows. 
* each row is called a **record**.
* each record contains many **entries**
* entries can take any datatype 
  * int, strings, doubles, bool, etc...
* To generate a new data frame we use 
  * pd.Dataframe() constructor.
  * we use dictionaries inside the constructor to initialize the data.
  * the keys, are the table columns' names
  * while the values, are the elements inside each entry. 
  * This is the standard way of constructing a new data frame. 
  * By default, the constructor will give the rows, numerical values, starting from 0 till n - 1.
  * sometimes, this works well, but other times, we may need to name each row, so to be able to do so, we use the parameter index.
  * The index items, should be = to the number of rows.
  * They can have different data types, and they can be the same element, but for best practice, they should be unique, of the same data types, and they should be representative.
  

In [7]:
pd.DataFrame({'Yes' : [50,21], 'No' : [131, 2]}, index = ['Women', 1])


Unnamed: 0,Yes,No
Women,50,131
1,21,2


# 2. Series
* Series is a sequence of data values.
* You can think of it as a single column in the dataframe.
* it is just a list of items. 
* as before, we can assign the names of the rows, using the index parameter.
* it may consist of different element of different datatypes.
* each series has a name, which is the column name in the dataframe.
* so if we combined many series, we can generate a dataframe.
* or you can think of it, as data frame is just a bunch of series combined together.


In [13]:
pd.Series([1,4,10], index= ['ahmed', 'mohammed','mostafa'], name = 'Goals of each player')

ahmed        1
mohammed     4
mostafa     10
Name: Goals of each player, dtype: int64

# Reading data files.
* usually we do not create dataframes from scratch.
* most of the time, we will work with data already exists.
* and when we load it, we will store it into a dataframe object. 
* There are a lot of types for the data files, but the most common ones are:
    1. csv -> comma separated values
    2. json -> java script object notation
    3. excel -> excel files
    4. sql -> sql database
* Usually we will dealt with the CSV files.
* when you open a csv file, you will see something like this:
  * ProductA, ProductB, ProductC,
  * 10, 20, 30,
  * 40, 50, 60,
  * 70, 80, 90,
* The first row, is the header, which contains the column names.
* the rest of the rows, are the data.

In [14]:
first_dummy_data = pd.read_csv('trial_csv_file.csv')
first_dummy_data.head()

Unnamed: 0,ProductA,ProductB,ProductC,Unnamed: 3
0,10,20,30,
1,40,50,60,
2,70,80,90,


# Basic functions
* df.head() -> returns the first 5 rows of the dataframe
* df.tail() -> returns the last 5 rows of the dataframe
* df.shape -> returns the number of rows and columns in the dataframe
* df.info() -> returns the data types of each column, and the number of non-null values in each column
* df.describe() -> returns the statistical summary of each numerical column
* df.column_name -> returns the column with the name column_name
* df[['column_name1', 'column_name2']] -> returns a dataframe with the two columns
* df.column_name.mean() -> returns the mean of the column
* df.column_name.std() -> returns the standard deviation of the column
* df.column_name.max() -> returns the maximum value of the column
* df.column_name.min() -> returns the minimum value of the column
* df.column_name.count() -> returns the number of non-null values in the column
* df.column_name.unique() -> returns the unique values in the column
* df.column_name.value_counts() -> returns the number of each unique value in the column
* df.column_name.idxmax() -> returns the index of the maximum value in the column
* df.column_name.idxmin() -> returns the index of the minimum value in the column
* df.column_name.sort_values() -> returns the column sorted in ascending order
* df.column_name.sort_values(ascending = False) -> returns the column sorted in descending order
* df.column_name.sort_values(ascending = False, inplace = True) -> returns the column sorted in descending order, and update the dataframe
* df.column_name.sort_index() -> returns the column sorted by the index
* df.column_name.sort_index(ascending = False) -> returns the column sorted by the index in descending order
* df.column_name.sort_index(ascending = False, inplace = True) -> returns the column sorted by the index in descending order, and update the dataframe
* df.column_name.plot(kind = 'hist') -> returns a histogram of the column
* df.column_name.plot(kind = 'box') -> returns a box plot of the column
* df.column_name.plot(kind = 'bar') -> returns a bar plot of the column
* df.column_name.plot(kind = 'pie') -> returns a pie plot of the column
* df.column_name.plot(kind = 'line') -> returns a line plot of the column
* df.column_name.plot(kind = 'area') -> returns an area plot of the column
* df.column_name.plot(kind = 'scatter') -> returns a scatter plot of the column
* df.column_name.plot(kind = 'density') -> returns a density plot of the column
* df.column_name.plot(kind = 'kde') -> returns a kde plot of the column
* df.column_name.plot(kind = 'hexbin') -> returns a hexbin plot of the column
* df.column_name.plot(kind = 'hist', bins = 10) -> returns a histogram of the column with 10 bins
* df.column_name.plot(kind = 'hist', bins = [1, 2, 3, 4, 5]) -> returns a histogram of the column with the specified bins
* df.column_name.plot(kind = 'hist', bins = 10, range = (0, 100)) -> returns a histogram of the column with 10 bins, and the range from 0 to 100
* df.column_name.plot(kind = 'hist', bins = 10, range = (0, 100), cumulative = True) -> returns a histogram of the column with 10 bins, and the range from 0 to 100, and the cumulative sum
* df.column_name.plot(kind = 'hist', bins = 10, range = (0, 100), cumulative = True, normed = True) -> returns a histogram of the column with 10 bins, and the range from 0 to 100, and the cumulative sum, and the normalized values
* df.column_name.plot(kind = 'hist', bins = 10, range = (0, 100), cumulative = True, normed = True, orientation = 'horizontal') -> returns a histogram of the column with 10 bins, and the range from 0 to 100, and the cumulative sum, and the normalized values, and the orientation is horizontal
* df.column_name.plot(kind = 'hist', bins = 10, range = (0, 100), cumulative = True, normed = True, orientation = 'horizontal', color = 'red') -> returns a histogram of the column with 10 bins, and the range from 0 to 100, and the cumulative sum, and the normalized values, and the orientation is horizontal, and the color is red
* df.column_name.plot(kind = 'hist', bins = 10, range = (0, 100), cumulative = True, normed = True, orientation = 'horizontal', color = 'red', alpha = 0.5) -> returns a histogram of the column with 10 bins, and the range from 0 to 100, and the cumulative sum, and the normalized values, and the orientation is horizontal, and the color is red, and the alpha is 0.5

# pd.read_csv() basic parameters
* pd.read_csv('file_name.csv') -> returns a dataframe from the csv file
* pd.read_csv('file_name.csv', header = None) -> returns a dataframe from the csv file, and the header is None
* pd.read_csv('file_name.csv', header = None, names = ['column_name1', 'column_name2']) -> returns a dataframe from the csv file, and the header is None, and the column names are the specified names
* pd.read_csv('file_name.csv', header = None, names = ['column_name1', 'column_name2'], index_col = 0) -> returns a dataframe from the csv file, and the header is None, and the column names are the specified names, and the index is the first column
* the index is the first column, so the first column will be removed from the dataframe, and will be used as the indcies.

In [17]:
data_frame_with_idx_Equal_zero = pd.read_csv('trial_csv_file.csv', index_col=0)
data_frame_with_idx_Equal_zero

Unnamed: 0_level_0,ProductB,ProductC,Unnamed: 3
indicies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,20,30,
1,50,60,
2,80,90,


# Saving The Dataframe to a file on your disk
* to do so we use a built in function called **to_csv** and just send to it the path of the file + the file name +.csv
* if it was already exist it will overwrite it. 
* to append on a existing file, we use the parameter mode = 'a'
* there are other modes, such as:
  * mode = 'w' -> write mode
  * mode = 'r' -> read mode
  * mode = 'x' -> create mode
  * mode = 'a' -> append mode
  * mode = 'r+' -> read and write mode
  * mode = 'w+' -> write and read mode
  * mode = 'a+' -> append and read mode



In [24]:
# save the data frame to the disk
data_frame_with_idx_Equal_zero.to_csv('new_csv_file.csv',mode='a')