# Introduction to pandas:
[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API.
It's a great tool for handling and analyzing input data, and many ML frameworks
support *pandas* data structures as inputs.  

Although a comprehensive
introduction to the API would span many pages, the core concepts are fairly
straightforward, and we'll present them below. For a more complete reference,
the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html)
contains extensive documentation and many tutorials. (Note that Colab may use a
slightly older version number, but the parts of *pandas* covered here are
unlikely to differ from version to version.)

In [14]:
import pandas as pd
pd.__version__

'2.2.2'

# Series and DataFrame:
The primary data structures in *pandas* are implemented as two classes:
* **`Series`**, which is a single column. Each row can be labeled via an index. A DataFrame contains one or more Series and a name for each Series.
* **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

### Series:
think of series as,
* A single column of data
* Like a list, but with superpowers:
    * It has values
    * It has labels (called an index)

In [15]:
cities = pd.Series(["chennai", "mumbai", "kolkata", "delhi"])
cities


Unnamed: 0,0
0,chennai
1,mumbai
2,kolkata
3,delhi


In [16]:
type(cities)

we can label them ourselves

In [17]:
cities = pd.Series({"south":"Chennai", "west":"Mumbai", "east":"Kolkata", "north":"New Delhi"})

cities

Unnamed: 0,0
south,Chennai
west,Mumbai
east,Kolkata
north,New Delhi


### DataFrame:
* `DataFrame` is a stack of a bunch of `Series` side by side each with a column name.
* its rows have indices
* Columns have names and are actually `Series` under the hood



In [18]:
cities = pd.Series(["chennai", "mumbai", "kolkata", "delhi"])
population = pd.Series([700000, 1700000, 800000])

city_info_df = pd.DataFrame({"cities": cities, "population": population})
city_info_df

Unnamed: 0,cities,population
0,chennai,700000.0
1,mumbai,1700000.0
2,kolkata,800000.0
3,delhi,


In [19]:
type(city_info_df)

# Exploring data in DataFrame:

In [26]:
from sklearn.datasets import load_diabetes


?load_diabetes

In [27]:
diabetes = load_diabetes(as_frame=True)
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have bee

* It gives you a **Bunch** object (a fancy dictionary), which contains several fields like this
```python
dict_keys([
    'data',            # numpy array of shape (442, 10): feature data
    'target',          # numpy array of shape (442,): continuous target values
    'frame',           # pandas DataFrame, optional (available if return_X_y=False and as_frame=True)
    'feature_names',   # list of 10 feature names
    'DESCR',           # long string description of the dataset
    'data_filename',   # path to .csv file with data (for legacy purposes)
    'target_filename'  # path to .csv file with target values (for legacy purposes)
])

```

In [23]:
type(diabetes)

In [29]:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [46]:
diabetes.data[:5]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [45]:
df = diabetes["data"]
df.shape

(442, 10)

442 rows and 10 columns

In [47]:
df.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

It gives pandas index type which is like a list but not python basic list. But it can be transformed to python list

In [48]:
list(df.columns)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [49]:
df.info

In [52]:
# first five entries
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [55]:
# Last ten entries
df.tail(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
432,0.009016,-0.044642,0.055229,-0.00567,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617
433,-0.02731,-0.044642,-0.060097,-0.02977,0.046589,0.01998,0.122273,-0.039493,-0.051404,-0.009362
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
435,-0.01278,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.03846,-0.038357
436,-0.05637,-0.044642,-0.074108,-0.050427,-0.02496,-0.047034,0.09282,-0.076395,-0.061176,-0.046641
437,0.041708,0.05068,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.05068,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.05068,-0.015906,0.017293,-0.037344,-0.01384,-0.024993,-0.01108,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.02656,0.044529,-0.02593
441,-0.045472,-0.044642,-0.07303,-0.081413,0.08374,0.027809,0.173816,-0.039493,-0.004222,0.003064


In [57]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118


In [59]:
df.describe().T # to transpose

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,442.0,-2.511817e-19,0.047619,-0.107226,-0.037299,0.005383,0.038076,0.110727
sex,442.0,1.23079e-17,0.047619,-0.044642,-0.044642,-0.044642,0.05068,0.05068
bmi,442.0,-2.245564e-16,0.047619,-0.090275,-0.034229,-0.007284,0.031248,0.170555
bp,442.0,-4.79757e-17,0.047619,-0.112399,-0.036656,-0.00567,0.035644,0.132044
s1,442.0,-1.3814990000000001e-17,0.047619,-0.126781,-0.034248,-0.004321,0.028358,0.153914
s2,442.0,3.9184340000000004e-17,0.047619,-0.115613,-0.030358,-0.003819,0.029844,0.198788
s3,442.0,-5.777179e-18,0.047619,-0.102307,-0.035117,-0.006584,0.029312,0.181179
s4,442.0,-9.04254e-18,0.047619,-0.076395,-0.039493,-0.002592,0.034309,0.185234
s5,442.0,9.293722000000001e-17,0.047619,-0.126097,-0.033246,-0.001947,0.032432,0.133597
s6,442.0,1.130318e-17,0.047619,-0.137767,-0.033179,-0.001078,0.027917,0.135612


In [58]:
df.describe(percentiles=[0.2, 0.4, 0.6])

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672
20%,-0.04547248,-0.04464164,-0.04048038,-0.04009893,-0.03871969,-0.03695017,-0.03971921,-0.03949338,-0.04117617,-0.03835666
40%,-0.005514555,-0.04464164,-0.01806189,-0.01944183,-0.0120262,-0.01559345,-0.01762938,-0.007684617,-0.01811369,-0.01184718
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698
60%,0.01628068,0.05068012,0.005218854,0.008100982,0.00806271,0.008706873,0.008142084,-0.002592262,0.01255119,0.007206516
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118
