## Pandas 

### Series

1. How to Create a Series?


In [1]:
import pandas as pd 

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [2]:
# Pass a list 
series = pd.Series([12,7.5,"Ashutosh","Delhi"])
series

0          12
1         7.5
2    Ashutosh
3       Delhi
dtype: object

In [3]:
# pass an index 
series = pd.Series([12,7.5,"Ashutosh","Delhi"], index=["10th marks","12th marks","name","place"])
series

10th marks          12
12th marks         7.5
name          Ashutosh
place            Delhi
dtype: object

In [4]:
# pass a dictionary
new_Series = pd.Series({"name":"Ashutosh", "age":29,"Place" : "Delhi"})
new_Series

Place       Delhi
age            29
name     Ashutosh
dtype: object

Accessing Elements in a Series

1. Using the index 

In [5]:
new_Series["name"]

'Ashutosh'

In [6]:
new_Series["age"]

29

In [7]:
new_Series.index

Index(['Place', 'age', 'name'], dtype='object')

In [8]:
# We can also use the position 
new_Series[0]

'Delhi'

In [9]:
new_Series[2]

'Ashutosh'

In [10]:
# how to give column names 
new_Series.index.name = "Info"
new_Series.name = "Details"
new_Series

Info
Place       Delhi
age            29
name     Ashutosh
Name: Details, dtype: object

In [11]:
# Looking from dictionary again
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)

Actinobacteria     569
Bacteroidetes      115
Firmicutes         632
Proteobacteria    1638
dtype: int64

In [12]:
# now it's sorting by index
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2

Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64

In [13]:
# how to find null values 
bacteria2.isnull()


Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

In [14]:
# Adding 2 series will sum the data with labels
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

print(bacteria)
bacteria + bacteria2

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64


Actinobacteria    1138.0
Bacteroidetes        NaN
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

In [15]:
newSeries = bacteria + bacteria2
newSeries.isnull()

Actinobacteria    False
Bacteroidetes      True
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
dtype: bool

In [16]:
~newSeries.notnull()

Actinobacteria    False
Bacteroidetes      True
Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
dtype: bool

In [17]:
newSeries[newSeries.isnull()]

Bacteroidetes   NaN
Cyanobacteria   NaN
dtype: float64

In [18]:
newSeries[ ~newSeries.isnull()]

Actinobacteria    1138.0
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

```python
# Adding 2 series will sum the data with labels
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

print(bacteria)
bacteria + bacteria2
```

### DataFrames

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. 

In [19]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [20]:
# We can select the columns using [[]] notation 

data[['phylum','value','patient']]

Unnamed: 0,phylum,value,patient
0,Firmicutes,632,1
1,Proteobacteria,1638,1
2,Actinobacteria,569,1
3,Bacteroidetes,115,1
4,Firmicutes,433,2
5,Proteobacteria,1130,2
6,Actinobacteria,754,2
7,Bacteroidetes,555,2


In [21]:
data[['phylum']]

Unnamed: 0,phylum
0,Firmicutes
1,Proteobacteria
2,Actinobacteria
3,Bacteroidetes
4,Firmicutes
5,Proteobacteria
6,Actinobacteria
7,Bacteroidetes


In [22]:
data['phylum']

0        Firmicutes
1    Proteobacteria
2    Actinobacteria
3     Bacteroidetes
4        Firmicutes
5    Proteobacteria
6    Actinobacteria
7     Bacteroidetes
Name: phylum, dtype: object

In [23]:
print(type(data[["phylum"]]))
print(type(data["phylum"]))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [24]:
# What's the difference in above 2?

In [25]:
data.index

RangeIndex(start=0, stop=8, step=1)

In [26]:
data.columns

Index(['patient', 'phylum', 'value'], dtype='object')

In [27]:
# we can also access a column by using attribute (using .)
data.value

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In Series when we used the indexing we got a row but in DataFrame we get column. 

**How to get the rows?**

In [28]:
print(data)
data.iloc[2] #iloc = indexed location

   patient          phylum  value
0        1      Firmicutes    632
1        1  Proteobacteria   1638
2        1  Actinobacteria    569
3        1   Bacteroidetes    115
4        2      Firmicutes    433
5        2  Proteobacteria   1130
6        2  Actinobacteria    754
7        2   Bacteroidetes    555


patient                 1
phylum     Actinobacteria
value                 569
Name: 2, dtype: object

In [29]:
data.loc[4] # loc is for label based location

patient             2
phylum     Firmicutes
value             433
Name: 4, dtype: object

In [30]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


### Create or modify columns by Assignment

In [31]:
data["value"][3] = 1000
data.value

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


0     632
1    1638
2     569
3    1000
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [32]:
data.loc[3,"value"] = 1024
data.value

0     632
1    1638
2     569
3    1024
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [33]:
# add new column
import numpy as np 
data["new_column"] = np.arange(1,80,10)
data

Unnamed: 0,patient,phylum,value,new_column
0,1,Firmicutes,632,1
1,1,Proteobacteria,1638,11
2,1,Actinobacteria,569,21
3,1,Bacteroidetes,1024,31
4,2,Firmicutes,433,41
5,2,Proteobacteria,1130,51
6,2,Actinobacteria,754,61
7,2,Bacteroidetes,555,71


In [34]:
# del can be used to delete the columns
del data["new_column"]
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [35]:
# how to get numpy array from a Dataframe
data.values

array([[1, 'Firmicutes', 632],
       [1, 'Proteobacteria', 1638],
       [1, 'Actinobacteria', 569],
       [1, 'Bacteroidetes', 1024],
       [2, 'Firmicutes', 433],
       [2, 'Proteobacteria', 1130],
       [2, 'Actinobacteria', 754],
       [2, 'Bacteroidetes', 555]], dtype=object)

In [36]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [37]:
data.index

RangeIndex(start=0, stop=8, step=1)

In [38]:
data.reindex(data.index[::-1])

Unnamed: 0,patient,phylum,value
7,2,Bacteroidetes,555
6,2,Actinobacteria,754
5,2,Proteobacteria,1130
4,2,Firmicutes,433
3,1,Bacteroidetes,1024
2,1,Actinobacteria,569
1,1,Proteobacteria,1638
0,1,Firmicutes,632


In [39]:
## give index less than length : TODO

In [40]:
# get the shape of the data 
data.shape

(8, 3)

In [41]:
# delete rows 
data = data.drop([2,5])
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [42]:
## TODO : give a range in drop to delete rows from 2 to 7

In [43]:
#numpy style slicing
data[:5]

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754


In [44]:
data.iloc[:5,:]

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754


In [45]:
#selecting rows and columns
data.iloc[2:3,1:3]

Unnamed: 0,phylum,value
3,Bacteroidetes,1024


In [46]:
data.iloc[2:3,1:2]

Unnamed: 0,phylum
3,Bacteroidetes


In [47]:
data.iloc[2:3,2:4]

Unnamed: 0,value
3,1024


In [48]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [49]:
# get values using names

data.loc[3:6,"patient" : "value"]

Unnamed: 0,patient,phylum,value
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754


In [50]:
# - how to not print a column

In [51]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [52]:
# Sorting values 

data.sort_values(ascending = False,by="value")

Unnamed: 0,patient,phylum,value
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
6,2,Actinobacteria,754
0,1,Firmicutes,632
7,2,Bacteroidetes,555
4,2,Firmicutes,433


In [53]:
data.sort_values(ascending = "False",by="patient")

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
3,1,Bacteroidetes,1024
4,2,Firmicutes,433
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [54]:
data.sort_values(by="phylum")

Unnamed: 0,patient,phylum,value
6,2,Actinobacteria,754
3,1,Bacteroidetes,1024
7,2,Bacteroidetes,555
0,1,Firmicutes,632
4,2,Firmicutes,433
1,1,Proteobacteria,1638


In [55]:
data.sort_values()

TypeError: sort_values() missing 1 required positional argument: 'by'

In [56]:
data.sort_values(by=["patient","phylum"],ascending=[False, True])

Unnamed: 0,patient,phylum,value
6,2,Actinobacteria,754
7,2,Bacteroidetes,555
4,2,Firmicutes,433
3,1,Bacteroidetes,1024
0,1,Firmicutes,632
1,1,Proteobacteria,1638


In [57]:
data
data.reset_index(drop=True)


Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Bacteroidetes,1024
3,2,Firmicutes,433
4,2,Actinobacteria,754
5,2,Bacteroidetes,555


In [60]:
data.drop("value",axis=1,inplace=True)
data

Unnamed: 0,patient,phylum
0,1,Firmicutes
1,1,Proteobacteria
3,1,Bacteroidetes
4,2,Firmicutes
6,2,Actinobacteria
7,2,Bacteroidetes


In [None]:
reindex ke baad 0,1,2
drop se delete 
