## Pandas 

### Series

1. How to Create a Series?


In [2]:
import pandas as pd 

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [3]:
# Pass a list 
series = pd.Series([12,7.5,"Ashutosh","Delhi"])
series

0          12
1         7.5
2    Ashutosh
3       Delhi
dtype: object

In [5]:
# pass an index 
series = pd.Series([12, 7.5, "Ashutosh","Delhi"], index=["10th marks","12th marks","name","place"])
series

10th marks          12
12th marks         7.5
name          Ashutosh
place            Delhi
dtype: object

In [6]:
# pass a dictionary
new_Series = pd.Series({"name":"Ashutosh", "age":29,"Place" : "Delhi"})
new_Series

Place       Delhi
age            29
name     Ashutosh
dtype: object

Accessing Elements in a Series

1. Using the index 

In [7]:
new_Series["name"]

'Ashutosh'

In [8]:
new_Series["age"]

29

In [7]:
new_Series.index

Index(['Place', 'age', 'name'], dtype='object')

In [9]:
# We can also use the position 
new_Series[0]

'Delhi'

In [10]:
new_Series[2]

'Ashutosh'

In [10]:
# how to give column names 
new_Series.index.name = "Info"
new_Series.name = "Details"
new_Series

Info
Place       Delhi
age            29
name     Ashutosh
Name: Details, dtype: object

In [13]:
# Looking from dictionary again
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)

Actinobacteria     569
Bacteroidetes      115
Firmicutes         632
Proteobacteria    1638
dtype: int64

In [14]:
# now it's sorting by index
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2

Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64

In [20]:
import numpy as np
# now it's sorting by index
bacteria24 = pd.Series(bacteria_dict, index=['Firmicutes','Proteobacteria','Actinobacteria'], dtype=object)
bacteria24

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
dtype: object

In [13]:
# how to find null values 
bacteria2.isnull()


Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

In [15]:
bacteria2.isnull().sum()

1

In [21]:
# Adding 2 series will sum the data with labels
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

print(bacteria)
print(bacteria2 )
bacteria + bacteria2

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64
Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64


Actinobacteria    1138.0
Bacteroidetes        NaN
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

In [22]:
newSeries = bacteria + bacteria2
newSeries.isnull()

Actinobacteria    False
Bacteroidetes      True
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
dtype: bool

In [23]:
~newSeries.notnull()

Actinobacteria    False
Bacteroidetes      True
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
dtype: bool

In [None]:
a[a > 2]

In [17]:
newSeries[newSeries.isnull()]

Bacteroidetes   NaN
Cyanobacteria   NaN
dtype: float64

In [25]:
newSeries[[True,False, False,True,False]]

Actinobacteria    1138.0
Firmicutes        1264.0
dtype: float64

In [26]:
newSeries[newSeries > 3200]

Proteobacteria    3276.0
dtype: float64

In [18]:
newSeries[ ~newSeries.isnull()]

Actinobacteria    1138.0
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

```python
# Adding 2 series will sum the data with labels
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

print(bacteria)
bacteria + bacteria2
```

### DataFrames

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. 

In [13]:
import pandas as pd 

data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


IF we are passing a dictionary, then the corresponding values should be of same length. 

In [4]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433,  754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

ValueError: arrays must all be same length

In [5]:
data['phylum']

0        Firmicutes
1    Proteobacteria
2    Actinobacteria
3     Bacteroidetes
4        Firmicutes
5    Proteobacteria
6    Actinobacteria
7     Bacteroidetes
Name: phylum, dtype: object

In [6]:
type(data['phylum'])

pandas.core.series.Series

In [7]:
data.phylum

0        Firmicutes
1    Proteobacteria
2    Actinobacteria
3     Bacteroidetes
4        Firmicutes
5    Proteobacteria
6    Actinobacteria
7     Bacteroidetes
Name: phylum, dtype: object

In [44]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433,1136,  754, 555],
                     'head':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [45]:
data.head

<bound method NDFrame.head of    head          phylum  value
0     1      Firmicutes    632
1     1  Proteobacteria   1638
2     1  Actinobacteria    569
3     1   Bacteroidetes    115
4     2      Firmicutes    433
5     2  Proteobacteria   1136
6     2  Actinobacteria    754
7     2   Bacteroidetes    555>

In [46]:
# for 2 columns we can pass a list 

data[['phylum','value']]

Unnamed: 0,phylum,value
0,Firmicutes,632
1,Proteobacteria,1638
2,Actinobacteria,569
3,Bacteroidetes,115
4,Firmicutes,433
5,Proteobacteria,1136
6,Actinobacteria,754
7,Bacteroidetes,555


In [47]:
type(data[['phylum','value']])

pandas.core.frame.DataFrame

In [48]:
type(data[['phylum']])

pandas.core.frame.DataFrame

In [49]:
# We can select the columns using [[]] notation 

data[['phylum','value','patient']]

KeyError: "['patient'] not in index"

In [50]:
data[['phylum']]

Unnamed: 0,phylum
0,Firmicutes
1,Proteobacteria
2,Actinobacteria
3,Bacteroidetes
4,Firmicutes
5,Proteobacteria
6,Actinobacteria
7,Bacteroidetes


In [51]:
data['phylum']

0        Firmicutes
1    Proteobacteria
2    Actinobacteria
3     Bacteroidetes
4        Firmicutes
5    Proteobacteria
6    Actinobacteria
7     Bacteroidetes
Name: phylum, dtype: object

In [52]:
print(type(data[["phylum"]]))
print(type(data["phylum"]))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [53]:
# What's the difference in above 2?

In [54]:
data.index

RangeIndex(start=0, stop=8, step=1)

In [55]:
data.columns

Index(['head', 'phylum', 'value'], dtype='object')

In [56]:
# we can also access a column by using attribute (using .)
data.value

0     632
1    1638
2     569
3     115
4     433
5    1136
6     754
7     555
Name: value, dtype: int64

In Series when we used the indexing we got a row but in DataFrame we get column. 

**How to get the rows?**

In [57]:
print(data)
data.iloc[2] #iloc = indexed location

   head          phylum  value
0     1      Firmicutes    632
1     1  Proteobacteria   1638
2     1  Actinobacteria    569
3     1   Bacteroidetes    115
4     2      Firmicutes    433
5     2  Proteobacteria   1136
6     2  Actinobacteria    754
7     2   Bacteroidetes    555


head                   1
phylum    Actinobacteria
value                569
Name: 2, dtype: object

In [58]:
data.iloc[2:6]

Unnamed: 0,head,phylum,value
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1136


In [59]:
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [60]:
data.loc[4] # loc is for label based location

head               2
phylum    Firmicutes
value            433
Name: 4, dtype: object

In [61]:
data.loc[2:6,"patient":"value"]

Unnamed: 0,phylum,value
2,Actinobacteria,569
3,Bacteroidetes,115
4,Firmicutes,433
5,Proteobacteria,1136
6,Actinobacteria,754


In [62]:
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [63]:
dm_data = data["phylum"]



### Create or modify columns by Assignment

In [64]:
data["value"][3] = 1000
data.value

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0     632
1    1638
2     569
3    1000
4     433
5    1136
6     754
7     555
Name: value, dtype: int64

In [65]:
data.loc[3,"value"] = 1024
data.value

0     632
1    1638
2     569
3    1024
4     433
5    1136
6     754
7     555
Name: value, dtype: int64

In [66]:
# add new column
import numpy as np 
data["new_column"] = np.arange(1,80,10)
data


Unnamed: 0,head,phylum,value,new_column
0,1,Firmicutes,632,1
1,1,Proteobacteria,1638,11
2,1,Actinobacteria,569,21
3,1,Bacteroidetes,1024,31
4,2,Firmicutes,433,41
5,2,Proteobacteria,1136,51
6,2,Actinobacteria,754,61
7,2,Bacteroidetes,555,71


In [67]:
# del can be used to delete the columns
del data["new_column"]
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [68]:
# how to get numpy array from a Dataframe
data.values

array([[1, 'Firmicutes', 632],
       [1, 'Proteobacteria', 1638],
       [1, 'Actinobacteria', 569],
       [1, 'Bacteroidetes', 1024],
       [2, 'Firmicutes', 433],
       [2, 'Proteobacteria', 1136],
       [2, 'Actinobacteria', 754],
       [2, 'Bacteroidetes', 555]], dtype=object)

In [69]:
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [70]:
data.index

RangeIndex(start=0, stop=8, step=1)

In [75]:
data.reindex(data.index[::-1])

Unnamed: 0,head,phylum,value
7,2,Bacteroidetes,555
6,2,Actinobacteria,754
5,2,Proteobacteria,1136
4,2,Firmicutes,433
3,1,Bacteroidetes,1024
2,1,Actinobacteria,569
1,1,Proteobacteria,1638
0,1,Firmicutes,632


In [77]:
data.reindex(["ashu","tosh","sing","h","is","a","genius","ok"], axis=0)

Unnamed: 0,head,phylum,value
ashu,,,
tosh,,,
sing,,,
h,,,
is,,,
a,,,
genius,,,
ok,,,


In [78]:
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [None]:
## give index less than length : TODO

In [79]:
# get the shape of the data 
data.shape

(8, 3)

In [82]:
# delete rows 
data.drop([2,5])
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
5,2,Proteobacteria,1136
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [83]:
data.drop([2,5], inplace=True)
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [None]:
## TODO : give a range in drop to delete rows from 2 to 7

In [84]:
#numpy style slicing
data[:5]

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754


In [85]:
data.iloc[:5,:]

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754


In [86]:
#selecting rows and columns
data.iloc[2:3,1:3]

Unnamed: 0,phylum,value
3,Bacteroidetes,1024


In [88]:
data.iloc[2:3,1:2]

Unnamed: 0,phylum
3,Bacteroidetes


In [None]:
data.iloc[2:3,2:4]

In [None]:
data

In [None]:
# get values using names

data.loc[3:6,"patient" : "value"]

In [None]:
# - how to not print a column

In [89]:
data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [96]:
data.iloc[2,2] = 324

data

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,324
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [97]:
data.dtypes

head       int64
phylum    object
value      int64
dtype: object

In [98]:
data.sort_values(by = "value")

Unnamed: 0,head,phylum,value
3,1,Bacteroidetes,324
4,2,Firmicutes,433
7,2,Bacteroidetes,555
0,1,Firmicutes,632
6,2,Actinobacteria,754
1,1,Proteobacteria,1638


In [99]:
# Sorting values 

data.sort_values(ascending = False,by="value")

Unnamed: 0,head,phylum,value
1,1,Proteobacteria,1638
6,2,Actinobacteria,754
0,1,Firmicutes,632
7,2,Bacteroidetes,555
4,2,Firmicutes,433
3,1,Bacteroidetes,324


In [101]:
data.sort_values(ascending = "False",by="head")

Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,324
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [103]:
data.sort_values(by="phylum")

Unnamed: 0,head,phylum,value
6,2,Actinobacteria,754
3,1,Bacteroidetes,324
7,2,Bacteroidetes,555
0,1,Firmicutes,632
4,2,Firmicutes,433
1,1,Proteobacteria,1638


In [102]:
data.sort_values()

TypeError: sort_values() missing 1 required positional argument: 'by'

In [105]:
data.sort_values(by=["head","phylum"],ascending=[False, True])

Unnamed: 0,head,phylum,value
6,2,Actinobacteria,754
7,2,Bacteroidetes,555
4,2,Firmicutes,433
3,1,Bacteroidetes,324
0,1,Firmicutes,632
1,1,Proteobacteria,1638


In [106]:
data.sort_values(by=["head","phylum"])

Unnamed: 0,head,phylum,value
3,1,Bacteroidetes,324
0,1,Firmicutes,632
1,1,Proteobacteria,1638
6,2,Actinobacteria,754
7,2,Bacteroidetes,555
4,2,Firmicutes,433


In [107]:
data
data.reset_index(drop=True)


Unnamed: 0,head,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Bacteroidetes,324
3,2,Firmicutes,433
4,2,Actinobacteria,754
5,2,Bacteroidetes,555


In [None]:
data.drop("value",axis=1,inplace=True)
data