### Introduction

Pandas is a Python library that enables easy and fast implementation of data analysis for tabular formatted data. According to Pandas documentation, It is:

```fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive```

Pandas is considered fast because most of its operations are carried out either using Numpy functions or using codes that are written in Cython and compiled to C. 

Like NumPy, Pandas is designed for vectorized operations that operate on entire columns or datasets in one sweep. Thinking about each “cell” or row individually should generally be a last resort, not a first.

What benifits does Pandas provide over Numpy in data analysis?

- Pandas has a data structure named `dataframe` which can handle different data types unlike Numpy that requires homogenous data.

- Data structures in Pandas have column names and row indexing that makes keeping track of data and observations super simple. With Numpy, it's sometimes cumbersome to keep track of the variables represented by columns.

- Pandas provide a plethora of built-in functions for data ingesting, manipulation, cleaning, and visualization on top of those offered by Numpy.

In practice, Numpy and Pandas are both used interchangeably; data scientists generally prefer Pandas to Numpy because of the features and convenience that comes with the former.


## Pandas data structure

`Dataframe` and `series` are two data structures in Pandas

### Series

A one-dimensional mutable array with homogeneous data that contains the elements of the same data type

A straightforwad way of creating a series is by calling upon the Series() method by passing a sequence of values as data such as list, dictionary, and 1-d numpy array.

```python
pandas_series = pd.Series(data, index=index)
```

In [98]:
pd.Series(range(3))

0    0
1    1
2    2
dtype: int64

In [99]:
pd.Series( range(3), index=['first', 'second','third'])

first     0
second    1
third     2
dtype: int64

### Dataframe

Dataframe is a two-dimensional array with heterogeneous data. A dataframe in Pandas is stored in memory as a collection of Series. As with Series, dataframe can be created with various inputs such as dictionary, list, numpy arrays and series. Alternatively, a dataframe can also be created by loading data from external sources such as flat files and database.

#### From List


An index (row labels) and columns (column labels) arguments can be optionally passed along with data to DataFrame method.

In [100]:
pd.DataFrame([[1,2,3],[4,5,6]])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


In [101]:
pd.DataFrame([[1,2,3],[4,5,6]], index = ['first', 'second'], columns = ['a','b','c'])

Unnamed: 0,a,b,c
first,1,2,3
second,4,5,6


**From a flat file** <br>
Pandas provide a list of functions to read various types of text files such as .CSV, .JSON and binary files such as SAS, STATA, and Excel spreadsheets.

In [102]:
CSV_FILE = "/home/asimbanskota/t81_577_data_science/weekly_materials/week6/files/city.csv"

df = pd.read_csv(CSV_FILE)

A head method can be used to select and output top n rows of the dataframe.

In [103]:
df.head(2)

Unnamed: 0,id,lat,lon,city,state
0,0,41,80,"""Youngstown""",OH
1,1,42,97,"""Yankton""",SD


The names of the index (axis = 0) and columns (axis = 1) can be provided during the import or can be renamed after import.

In [104]:
df.rename({'city':'CITY'}, axis = 1).head(2)

Unnamed: 0,id,lat,lon,CITY,state
0,0,41,80,"""Youngstown""",OH
1,1,42,97,"""Yankton""",SD


Somtimes the size of the dataset could be very big such that the data might not fit into the computer memory or processing of the data could be very slow. In such cases, the data could be iteratively read and preprocessed in chunks (segments of data) and later concatenated together in a manageable size for further processing.

In [65]:
beer_review_file = "/home/asimbanskota/t81_577_data_science/weekly_materials/week6/data/beer_reviews.csv"
df_chunks = pd.read_csv(beer_review_file, chunksize=100000)
chunk_list = [] 
for chunk in df_chunks:  
    ## do something like preprocessing
    chunk_processed = chunk.head(5)
    chunk_list.append(chunk_processed)
    
df_concat = pd.concat(chunk_list)

#### From Database

As we learnt in week5, most of the enterpise data is stored in relational or other types of databases. Pandas provide `read_sql` method to load data from most of the common databases. As with loading data in Python, a connection to the database is needed to be created first. In the following example, a connection to Postgres database is created in the same way as we did last week using `psycopg2` library. After creating the connection object, the data can be read simply by passing sql query as an argument to the `read_sql` method. As above, the dataset can be be read and preprocessed in chunks by passing the `chunksize` arguement.

In [105]:
## from database
import psycopg2
conn = psycopg2.connect(host="localhost",database="t81577", user="postgres", password="postgres")
sql = """SELECT * from cities"""
pd.read_sql(sql, conn).head()

Unnamed: 0,id,lat,lon,city,state
0,0,41.0,80.0,""" """"Youngstown""""""",OH
1,1,42.0,97.0,""" """"Yankton""""""",SD
2,2,46.0,120.0,""" """"Yakima""""""",WA
3,3,42.0,71.0,""" """"Worcester""""""",MA
4,4,43.0,89.0,""" """"Wisconsin Dells""""""",WI


### Indexing DataFrames

In pandas there are two ways to index and extract subset of a dataframe. 

1. `loc`: to select rows and columns by label

2. `iloc`: to retrieve rows and columns by position


In [106]:
df.iloc[0:2,0:2]

Unnamed: 0,id,lat
0,0,41
1,1,42


In [107]:
df.loc[0:2,['id','lat']]

Unnamed: 0,id,lat
0,0,41
1,1,42
2,2,46


### Selecting, adding, and deleting columns

A dataframe can be treated just like a dictionary of Series objects. Getting columns works with the same syntax as the analogous dict operations.


In [108]:
df["lat"].head(2)

0    41
1    42
Name: lat, dtype: int64

Alternatively, a single column can also be accessed as a column attribute if a column label is a valid Python variable name.

In [109]:
df.lon.head(2)

0    80
1    97
Name: lon, dtype: int64

To select more than one columns, just pass the list of the column names.

In [110]:
df[["city", "lat", "lon"]].head(3)

Unnamed: 0,city,lat,lon
0,"""Youngstown""",41,80
1,"""Yankton""",42,97
2,"""Yakima""",46,120


In [111]:
del df["id"]
df.head()

Unnamed: 0,lat,lon,city,state
0,41,80,"""Youngstown""",OH
1,42,97,"""Yankton""",SD
2,46,120,"""Yakima""",WA
3,42,71,"""Worcester""",MA
4,43,89,"""Wisconsin Dells""",WI


### Getting array values

Numpy array values of a dataframe or series can be retrived by calling the values attribute.


In [112]:
df.iloc[0:5,:].values

array([[41, 80, ' "Youngstown"', ' OH'],
       [42, 97, ' "Yankton"', ' SD'],
       [46, 120, ' "Yakima"', ' WA'],
       [42, 71, ' "Worcester"', ' MA'],
       [43, 89, ' "Wisconsin Dells"', ' WI']], dtype=object)

In [113]:
df['lon'].iloc[0:5].values

array([ 80,  97, 120,  71,  89])

### Writing data
The pandas dataframe can be saved into various file formats using functions like `to_csv` for .CSV format.

In [None]:
df.to_csv(OUTPUT_FILE_NAME)

In [None]:
pd.cut(np.array([1, 7, 5, 4, 6, 3]),
       3, labels=["bad", "medium", "good"])