# Introduction to Data Analysis Using Pandas

## What is Pandas
Pandas is an open-source python package that provides numerous tools for high-performance data analysis and data manipulation.

#### Pandas Datastructures
Pandas supports two datastructures

1. Pandas Series
2. Pandas DataFrame

#### Pandas Series
Pandas Series is a one-dimensional labeled array capable of holding any data type. Pandas Series is built on top of NumPy array objects.

![pd_series](data/pd_series.png)

In Pandas Series, we can mention index labels. If not provided, by default it will take default indexing(RangeIndex 0 to n-1 )

![pd_series2](data/pd_series2.png)

#### Pandas DataFrame
Pandas Dataframe is a two dimensional labeled data structure. It consists of rows and columns.

Each column in Pandas DataFrame is a Pandas Series.

##### How to Create Pandas DataFrames?
We can create pandas dataframe from dictionaries,json objects,csv file etc.
![pd_frame](data/data_frame.png)



In [None]:
import pandas as pd

## Creating dataframe

In [None]:
# Create a DataFrame manually from a dictionary of Pandas Series

# create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# print the type of items to see that it is a dictionary
print(type(items)) 

# create a Pandas DataFrame by passing it a dictionary of Series
shopping_carts = pd.DataFrame(items)
shopping_carts

In [None]:
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

In [None]:
# create a DataFrame that only has a subset of the data/columns
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

bob_shopping_cart

In [None]:
# create a DataFrame that only has selected keys
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])
sel_shopping_cart

In [None]:
# combine both of the above - selected keys for selected columns
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])
alice_sel_shopping_cart

In [None]:
# create DataFrames from a dictionary of lists (arrays)
# In this case, however, all the lists (arrays) in the dictionary must be of the same length

# create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# create a DataFrame 
df = pd.DataFrame(data)

df

In [None]:
# create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

df

In [None]:
# create DataFrames from a list of Python dictionaries
# create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# create a DataFrame 
store_items = pd.DataFrame(items2)
store_items

In [None]:
# create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])
store_items.head()

## Loading Data into DF

In [None]:
# Loading Data into DF
filename = 'data/data.csv'

df = pd.read_csv(filename)
df

*Note:* different reading methods available on Pandas [Documentation website](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

In [None]:
# limit which rows are read when reading in a file
pd.read_csv(filename, nrows=10)        
# only read first 10 rows

In [None]:
pd.read_csv(filename, skiprows=[1, 2]) 
# skip the first two rows of data

In [None]:
# randomly sample a DataFrame
train = df.sample(frac=0.75) 
# will contain 75% of the rows
train

In [None]:
test = df[~df.index.isin(train.index)] 
# will contain the other 25%
test

In [None]:
# change the maximum number of rows and columns printed ('None' means unlimited)
pd.set_option('display.max_rows', None)
# default is 60 rows

pd.set_option('display.max_columns', None) 
# default is 20 columns
print (df)

In [None]:
# reset options to defaults
pd.reset_option('max_rows')
pd.reset_option('max_columns')

# change the options temporarily (settings are restored when you exit the ‘with’ block)
#with pd.option_context('max_rows', None, 'max_columns', None):
df

## Dealing with NaN values (missing data)

In [None]:
# Dealing with NaN values (missing data)

# create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# check if we have any NaN values in our dataset
# .any() performs an or operation. If any of the values along the
# specified axis is True, this will return True.
df.isnull().any()
'''
Date   False
Open   True
High   False
Low    False
Close  False
Volume False
dtype: bool
'''

# count the number of NaN values in DataFrame
x =  store_items.isnull().sum().sum()
print("Number of NaN values is:", x)

In [None]:
# count the number of non-NaN values in DataFrame
x = store_items.count()
print("Number of non-NaN values is:", x)

In [None]:
original_store = store_items.copy()
# the original DataFrame is not modified by default
# to remove missing values from original df, use inplace = True
store_items.dropna(axis = 0, inplace = True)

print(store_items)

In [None]:
print(original_store)
# replace all NaN values with 0
original_store.fillna(0)

In [None]:
# forward filling: replace NaN values with previous values in the df,
# this is known as . When replacing NaN values with forward filling,
# we can use previous values taken from columns or rows.
# replace NaN values with the previous value in the column
original_store.fillna(method = 'ffill', axis = 0)

In [None]:
# backward filling: replace the NaN values with the values that
# go after them in the DataFrame
# replace NaN values with the next value in the row
original_store.fillna(method = 'backfill', axis = 1)

## head, tail, describe, max, memory_usage

In [None]:
data = pd.read_csv(filename)
data.head()

In [None]:
data.head()
data.tail()
data.describe()
# prints max value in each column
data.max()

# display the memory usage of a DataFrame

# total usage
data.info()

# usage by column
data.memory_usage()

### Exporting dataframes


In [None]:
df.to_csv('new_data.csv')