# Pandas

Python has long been great for data munging and preparation, but less so for data analysis and modeling. Pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like *R*. Pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. It is built on top of NumPy and makes it easy to use. As a result, pandas uses many of the same notations, conventions and follows tha same standards.

There are two main data structures in pandas; **Series** and **DataFrame**. We will learn about them one by one. 

In [120]:
import pandas as pd # to work with pandas you will have to import it
import numpy as np # in-case we need it

## Pandas Series

A **Pandas series** is a *one-dimensional* array-like object containing an array of data and an associated array of data labels, called its **index**.

### From list

In [121]:
colors = pd.Series(['Red', 'Blue', 'Green'])
colors

0      Red
1     Blue
2    Green
dtype: object

The string representation of a pandas series shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the pandas series via its values and index attributes, respectively.

```{note}

The **object** data type is the one data type that is unlike the others. A column that is of *object* data type may contain values that are of any valid Python object. Typically, when a column is of the *object* data type, it signals that the entire column is **strings**. This isn't necessarily the case as it is possible for these columns to contain a mixture of *integers*, *booleans*, *strings*, or other, even more complex Python objects such as lists or dictionaries. The object data type is a **catch-all for columns that pandas doesn’t recognize as any other specific type**.
```

You can also use `.values` to get a numpy array containing the value of the series.

In [122]:
colors.values

array(['Red', 'Blue', 'Green'], dtype=object)

You also can use `.index` to get the index of the series. By default it is `RangeIndex`. 

In [123]:
colors.index # similar to range() function

RangeIndex(start=0, stop=3, step=1)

Often it will be desirable to create a pandas series with an index identifying each data point: 

In [124]:
fruits = pd.Series(['Red', 'Yellow', 'Orange', 'Green'], index=['Apple', 'Mango', 'Orange', 'Grapes']) 
fruits

Apple        Red
Mango     Yellow
Orange    Orange
Grapes     Green
dtype: object

Similar to numpy arrays, you can use *values in the index* when selecting single values or a set of values: 

In [125]:
fruits['Apple']

'Red'

In [126]:
fruits['Grapes'] = 'Black'

In [127]:
fruits[['Mango', 'Orange', 'Grapes']] # passing a list of index

Mango     Yellow
Orange    Orange
Grapes     Black
dtype: object

NumPy array operations, such as *filtering with a boolean array*, *scalar multiplication*, or applying math functions, will preserve the index-value link.

```{admonition} Exercise 
Try applying the above operations to a pandas series with interger values. Heres one to get you started.
```

```{code-block} python

pd.Series([10,20,30, 40, 50], index=['ten', 'twenty', 'thirty', 'forty', 'fifty'])

```

In [128]:
## filter values greater than 25

In [129]:
## Multiply all the values by 3

In [130]:
## Take squares, log & exp of each element

### From Dictionary

Another way to think about a pandas eries is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict:

In [131]:
'kiwi' in fruits

False

In [132]:
'Apple' in fruits

True

You can create a pandas series from a python dictionary

In [133]:
fruits = {'Apple': 'Red',
          'Mango': 'Yellow',
          'Orange': 'Orange',
          'Grapes': 'Green'}
fruits = pd.Series(fruits)
fruits

Apple        Red
Mango     Yellow
Orange    Orange
Grapes     Green
dtype: object

In [134]:
fruits.index

Index(['Apple', 'Mango', 'Orange', 'Grapes'], dtype='object')

You can achieve the same by passing the values and index separately, as we did earlier.

In [135]:
fruits = pd.Series(['Red', 'Yellow', 'Orange', 'Green'], 
              index=['Apple', 'Mango', 'Orange', 'Grapes'])
fruits

Apple        Red
Mango     Yellow
Orange    Orange
Grapes     Green
dtype: object

A critical pandas series feature for many applications is that it automatically aligns differently indexed data in arithmetic operations.

In [136]:
maths = pd.Series([10, 12, 8, 4, 18], index=['Ram', 'Sham', 'Ganesh', 'Ramesh', 'Raju'])
maths

Ram       10
Sham      12
Ganesh     8
Ramesh     4
Raju      18
dtype: int64

In [137]:
english = pd.Series([9, 13, 10, 18, 8], index=['Ram', 'Sham', 'Ganesh', 'Ramesh', 'Rakesh'])
english

Ram        9
Sham      13
Ganesh    10
Ramesh    18
Rakesh     8
dtype: int64

In [138]:
total = maths + english
total

Ganesh    18.0
Raju       NaN
Rakesh     NaN
Ram       19.0
Ramesh    22.0
Sham      25.0
dtype: float64

`NaN` (not a number) which is considered in pandas to mark missing or NA values. The `isnull` and `notnull` functions in pandas should be used to detect missing data: 

In [139]:
total.isnull()

Ganesh    False
Raju       True
Rakesh     True
Ram       False
Ramesh    False
Sham      False
dtype: bool

In [140]:
total.notnull()

Ganesh     True
Raju      False
Rakesh    False
Ram        True
Ramesh     True
Sham       True
dtype: bool

You can also use `pd.isnull(total)` and `pd.notnull(total)` to achieve the same.

## Pandas DataFrame

A **DataFrame** represents a tabular, spreadsheet-like data structure containing an *ordered collection of columns*, each of which can be a different value type (numeric, string, boolean, etc.). In fact, it is the central data structure in pandas, and you can apply all kinds of operations on it. The DataFrame has both a *row* and *column* index; it can be thought of as a dictionary of pandas series (one for all sharing the same index). Both, row-oriented and column-oriented operations in DataFrame are treated roughly symmetrically. 

There are numerous ways to construct a DataFrame, though one of the **most common is from a dictionary of equal-length lists or NumPy arrays**

### From dictionary

In [141]:
marks = {'Maths': [15, 17, 6, 14, 19],        
        'English': [20, 20, 12, 13, 18],        
        'Science': [15, 7, 16, 20, 9]} 
marks = pd.DataFrame(marks) 
marks

Unnamed: 0,Maths,English,Science
0,15,20,15
1,17,20,7
2,6,12,16
3,14,13,20
4,19,18,9


The resulting DataFrame will have its index assigned automatically as with pandas series, and the columns are placed in the same order as in the dictionary.

If you specify a sequence of columns, the DataFrame’s columns will be exactly what you pass

In [142]:
marks = pd.DataFrame(marks, columns=['English', 'Maths', 'Science'])
marks

Unnamed: 0,English,Maths,Science
0,20,15,15
1,20,17,7
2,12,6,16
3,13,14,20
4,18,19,9


As with pandas series, if you pass a column that isn’t contained in data, it will appear with `NaN` values in the result: 

In [143]:
marks = pd.DataFrame(marks, columns=['English', 'Maths', 'Science', 'History'])
marks

Unnamed: 0,English,Maths,Science,History
0,20,15,15,
1,20,17,7,
2,12,6,16,
3,13,14,20,
4,18,19,9,


A column in a DataFrame can be retrieved as a pandas series either by dict-like notation or by attribute: 

In [144]:
marks['Maths'] # dict-like notation

0    15
1    17
2     6
3    14
4    19
Name: Maths, dtype: int64

In [145]:
marks.Maths # attribute like notation

0    15
1    17
2     6
3    14
4    19
Name: Maths, dtype: int64

```{note} 
The returned pandas series have the *same index* as the DataFrame, and their *name* attribute has been appropriately set. 
```

Columns can be modified by assigning a scalar value or an array of values: 

In [146]:
marks['History'] = 15
marks

Unnamed: 0,English,Maths,Science,History
0,20,15,15,15
1,20,17,7,15
2,12,6,16,15
3,13,14,20,15
4,18,19,9,15


In [147]:
marks['History'] = np.random.randn(5)*10
marks

Unnamed: 0,English,Maths,Science,History
0,20,15,15,4.746769
1,20,17,7,-3.843704
2,12,6,16,-16.124691
3,13,14,20,15.110036
4,18,19,9,2.528761


When assigning *lists* or *arrays* to a column, the value’s length must match the length of the *DataFrame*. If you assign a *pandas series*, it will be instead conformed exactly to the DataFrame’s index, inserting missing values as `NaN`.

In [148]:
chemistry = pd.Series([10, 12, 13, 15, 9], index=[1,2,3,4,5])
chemistry

1    10
2    12
3    13
4    15
5     9
dtype: int64

Assigning a column that doesn’t exist will create a new column. Lets, create a *Chemistry* column.

In [149]:
marks['Chemistry'] = chemistry
marks

Unnamed: 0,English,Maths,Science,History,Chemistry
0,20,15,15,4.746769,
1,20,17,7,-3.843704,10.0
2,12,6,16,-16.124691,12.0
3,13,14,20,15.110036,13.0
4,18,19,9,2.528761,15.0


```{warning} 
Pay extra attention to *indexes* when working with *DataFrames* and *Series*. Almost every operation are index dependent (i.e. operations will happen along the indexes)
```
```{note} 
The column returned when indexing a DataFrame is a *view* on the underlying data, not a *copy*. Thus, any in-place modifications to the pandas series will be reflected in the DataFrame. The column can be explicitly copied using the series’s `copy()` method.
```

Another common way of creating DateFrames is using `ndarray`.

In [150]:
random_data = np.random.randint(10,100,25).reshape(5,5)

In [151]:
random_data

array([[23, 15, 37, 79, 93],
       [22, 59, 25, 88, 56],
       [43, 52, 76, 74, 57],
       [65, 11, 65, 22, 40],
       [83, 69, 87, 81, 97]])

In [152]:
list('ABCDE')

['A', 'B', 'C', 'D', 'E']

In [153]:
pd.DataFrame(random_data, index=['monday', 'tuesday', 'wednesday', 'thursday', 'friday'],columns=list('ABCDE'))

Unnamed: 0,A,B,C,D,E
monday,23,15,37,79,93
tuesday,22,59,25,88,56
wednesday,43,52,76,74,57
thursday,65,11,65,22,40
friday,83,69,87,81,97


You can also create DataFrame by using **dict-of-dict** or **dict-of-series** data structures. You can find the complete list of things that you can pass the the DataFrame constructor can be found on their website. We recommend you to go and read it.  

## Reading dataset
Most of the time, we load our data from some external source like *.csv file*, *sql table*, *excel*, *network*, *xml file*. Pandas have it all covered. Pandas has a number of functions for reading tabular data as a DataFrame object.

```{note}
You will get all the datasets which will be required in this course in a zip file on **moodle**. And your path to datasets folder might not be same as our. So before you start reading datasets recall the *file path concept* from python's *file* chapter.
```

```{note}
`read_csv()` - Load delimited data from a file, URL, or file-like object. Use comma as default delimiter.
```

Lets first have a look at out *.csv* file using UNIX *cat* command.

In [154]:
!cat ../data/sample_data.csv

,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [155]:
df = pd.read_csv("../data/sample_data.csv") # reading a csv file
df

Unnamed: 0.1,Unnamed: 0,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3
4,Dean,AK,gray,Cheese,32,180,1.8
5,Christina,TX,black,Melon,33,172,9.5
6,Cornelia,TX,red,Beans,69,150,2.2


We can also have `read_table` specifying the delimiter mentioned as `sep` we had **csv** file (Comma Seperated Value file) hence we have used the delimeter or `sep` as `,`.

In [156]:
pd.read_table('../data/sample_data.csv', sep=',')

  """Entry point for launching an IPython kernel.


Unnamed: 0.1,Unnamed: 0,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3
4,Dean,AK,gray,Cheese,32,180,1.8
5,Christina,TX,black,Melon,33,172,9.5
6,Cornelia,TX,red,Beans,69,150,2.2


In [157]:
pd.read_csv('../data/sample_data.csv', header=None)

Unnamed: 0,0,1,2,3,4,5,6
0,,state,color,food,age,height,score
1,Jane,NY,blue,Steak,30,165,4.6
2,Niko,TX,green,Lamb,2,70,8.3
3,Aaron,FL,red,Mango,12,120,9.0
4,Penelope,AL,white,Apple,4,80,3.3
5,Dean,AK,gray,Cheese,32,180,1.8
6,Christina,TX,black,Melon,33,172,9.5
7,Cornelia,TX,red,Beans,69,150,2.2


Pandas is not just limited to reading csv's. Here is a complete list of sources, from which pandas can read data.

In [158]:
pd.read_*?

pd.read_clipboard
pd.read_csv
pd.read_excel
pd.read_feather
pd.read_fwf
pd.read_gbq
pd.read_hdf
pd.read_html
pd.read_json
pd.read_msgpack
pd.read_parquet
pd.read_pickle
pd.read_sas
pd.read_sql
pd.read_sql_query
pd.read_sql_table
pd.read_stata
pd.read_table

## Conclusion

### Questionaire
1. How to create a pandas series from python list and how to assign index values to them?
2. Compare pandas series with python lists and find similarities and differences
3. What type of data can pandas read ?

### Further Reading
It can read data literally from any source and format. To learn more about it, you can refer [this](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures).