#Pandas
[Pandas](https://pandas.pydata.org) package is essential for data analysis. By building on top of NumPy, Pandas provides:
*   labeled arrays
*   missing data handling
*   convenient methods (groupby, rolling, resample)
*   more data types (Categorical, Datetime)
*   built in functionality for reading/writing many kinds of files
*   methods for basic statistical analysis and data visualization

Similar to Numpy, in order to be ble to use Pandas, we need to import the `pandas` package.



In [None]:
import pandas as pd
import numpy as np

from google.colab import drive
drive.mount('/content/drive') #this command mounts our Google Drive so that we can load dataset stored there

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Pandas data structures

###Series
A Series is a single vector of data with an index that labels each element in the vector.

In [None]:
counts = pd.Series([16, 12, 12, 50000])
print(counts)

0       16
1       12
2       12
3    50000
dtype: int64


The first column correspond to `index`. If an index is not specified a default sequence of integers is returned.

To obtain only the values without the index, the keyword `values` is used.

In [None]:
print(counts.values)

[   16    12    12 50000]


We can always assign meaningful labels to the `index`.

In [None]:
characteristics = pd.Series([16, 12, 12, 50000],
                            index=['years of education', 'mother years of education', 'father years of education', 'income'])
print(characteristics)

years of education              16
mother years of education       12
father years of education       12
income                       50000
dtype: int64


Index labels can be used to retrieve a specific element of the Series.

In [None]:
c_years_of_education = characteristics['years of education']
print(c_years_of_education)

16


We can also retrieve elements with boolean indexing.

In [None]:
bool_index=[True, False, False, True]
b_characteristics = characteristics[bool_index]
print(b_characteristics)

years of education       16
income                50000
dtype: int64


We can also give meaningful names to the whole series and to the index.

In [None]:
marks = pd.Series([75, 83, 96, 49, 54], index=['George', 'John', 'Elena', 'Jason', 'Maria'])
print(marks)

George    75
John      83
Elena     96
Jason     49
Maria     54
dtype: int64


In [None]:
marks.name = 'Final Mark'
marks.index.name = 'Student'
print(marks)

Student
George    75
John      83
Elena     96
Jason     49
Maria     54
Name: Final Mark, dtype: int64


In [None]:
print(marks > 80)

Student
George    False
John       True
Elena      True
Jason     False
Maria     False
Name: Final Mark, dtype: bool


In [None]:
very_good_students = marks[marks > 80] # retrieve elements using boolean indexing
print(very_good_students)

Student
John     83
Elena    96
Name: Final Mark, dtype: int64


Series are stored as key-value structures. Therefore, we can use dictionaries to create pandas series.

In [None]:
student_dict = {'George':75,
                'John':83,
                'Elena':96,
                'Jason':49,
                'Maria':54}
marks1 = pd.Series(student_dict)
print(marks1)

George    75
John      83
Elena     96
Jason     49
Maria     54
dtype: int64


We pass a disctionary to a `Series` using the index keyword. If a key in the index does not have a corresponding value, Pandas will assign to it the value `NaN`.

In [None]:
marks2 = pd.Series(student_dict, index=['George', 'John', 'Elena', 'Jason', 'Maria', 'Chris'])
print(marks2)

George    75.0
John      83.0
Elena     96.0
Jason     49.0
Maria     54.0
Chris      NaN
dtype: float64


We can check for null values by using the method `.isnull()`.

In [None]:
print(marks2.isnull())

George    False
John      False
Elena     False
Jason     False
Maria     False
Chris      True
dtype: bool


###DataFrame
Contrary to `Series`, a `DataFrame` is useful for
* storing
* viewing and
* manipulating

*multivariate* data.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object.


In [None]:
student_dict = {'First Name':['George', 'John', 'Elena', 'Jason', 'Maria'],
                'Family Name': ['Smith', 'Romel', 'Burghes', 'Proto', 'Marianou'],
                'Mark': [75, 83, 96, 49, 54]}

data = pd.DataFrame(student_dict)
print(data)

  First Name Family Name  Mark
0     George       Smith    75
1       John       Romel    83
2      Elena     Burghes    96
3      Jason       Proto    49
4      Maria    Marianou    54


Explicitly denoting the order of column names, we can change the way the data are printed.

In [None]:
print(data[['Mark', 'First Name', 'Family Name']])

   Mark First Name Family Name
0    75     George       Smith
1    83       John       Romel
2    96      Elena     Burghes
3    49      Jason       Proto
4    54      Maria    Marianou


If we wish to access columns, we can do so either by dict-like indexing or by attribute.

In [None]:
all_marks = data['Mark']
print(all_marks) # dict-like indexing
print(type(all_marks))

0    75
1    83
2    96
3    49
4    54
Name: Mark, dtype: int64
<class 'pandas.core.series.Series'>


In [None]:
print(data.Mark) # indexing by attribute. The attribute name is the same with the column name
print(type(data.Mark))

0    75
1    83
2    96
3    49
4    54
Name: Mark, dtype: int64
<class 'pandas.core.series.Series'>


###Indexing a `DataFrame`

If we want access to a row in a `DataFrame`, we index its `loc` or `iloc` attribute.

In [None]:
np.random.seed(0)
data = pd.DataFrame({'v1': np.random.rand(4),
                     'v2': np.random.rand(4)},
                    index=['w','x','y','z'])
print(data.head())

         v1        v2
w  0.548814  0.423655
x  0.715189  0.645894
y  0.602763  0.437587
z  0.544883  0.891773


If we want to access something by its label, we use the attribute `loc`.

In [None]:
y_values = data.loc[['y']]
print(y_values)

         v1        v2
y  0.602763  0.437587


In [None]:
xy_values = data.loc[['y', 'x']]
print(xy_values)

         v1        v2
y  0.602763  0.437587
x  0.715189  0.645894


If we want to access specific columns, we need to add their label to the loc attribute.

In [None]:
xyz_v1_values = data.loc[['x', 'y', 'z'], ['v1']]
print(xyz_v1_values)

         v1
x  0.715189
y  0.602763
z  0.544883


In [None]:
xyz_v1_values = data.loc['x':'z', ['v1']] # the ':' symbol return a list, therefore we don't need additional brackets
print(xyz_v1_values)

         v1
x  0.715189
y  0.602763
z  0.544883


Whereas if we want to access something by its row/column number, use `.iloc`

In [None]:
print(data)
print()
print(data.iloc[[2]]) # python starts counting from 0, so it returms the 3rd row

         v1        v2
w  0.548814  0.423655
x  0.715189  0.645894
y  0.602763  0.437587
z  0.544883  0.891773

         v1        v2
y  0.602763  0.437587


In [None]:
print(data.iloc[0:2]) # it returns the 1ts and 2nd rows

         v1        v2
w  0.548814  0.423655
x  0.715189  0.645894


In [None]:
print(data.iloc[0:2, [0]]) # returns the 1st and 2nd rows and the first column

         v1
w  0.548814
x  0.715189


### Boolean indexing

`.loc` also supports boolean indices, which is very useful for data wrangling and exploration.


In [None]:
v1_geq = data.v1 >= 0.6
print()
print(v1_geq)


w    False
x     True
y     True
z    False
Name: v1, dtype: bool


We can use boolean comparisons to retrieve a subset of the dataset.

In [None]:
data_geq = data.loc[data.v1 >= 0.6]
print(data_geq)

         v1        v2
x  0.715189  0.645894
y  0.602763  0.437587


And even combine boolean comparisons.

In [None]:
data_geq_leq = data.loc[(data.v1 >= 0.6) & (data.v2 <= 0.5)]
print(data_geq_leq)

         v1        v2
y  0.602763  0.437587


In [None]:
data_geq_leq_v2 = data.loc[(data.v1 > 0.6) & (data.v2 < 0.5), ['v2']]
print(data_geq_leq_v2)

         v2
y  0.437587


### Changing data frames

In [None]:
data = pd.DataFrame.from_dict({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                               1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                               2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                               3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                               4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                               5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                               6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                               7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}}, orient='index')
print(data)

   patient          phylum  value
0        1      Firmicutes    632
1        1  Proteobacteria   1638
2        1  Actinobacteria    569
3        1   Bacteroidetes    115
4        2      Firmicutes    433
5        2  Proteobacteria   1130
6        2  Actinobacteria    754
7        2   Bacteroidetes    555


In [None]:
value = data.value
print(value)

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64


We can use the `value` as a series that points to the `data` DataFrame and change one or more of its elements.

In [None]:
value[5] = 0
print(value)
print()
print(data)

0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: int64

   patient          phylum  value
0        1      Firmicutes    632
1        1  Proteobacteria   1638
2        1  Actinobacteria    569
3        1   Bacteroidetes    115
4        2      Firmicutes    433
5        2  Proteobacteria      0
6        2  Actinobacteria    754
7        2   Bacteroidetes    555


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  value[5] = 0


However, if we want to leave `data` intact and work on the `value` series we should use the method `.copy()`.

In [None]:
value = data.value.copy()
value[2] = 1000
print(value)
print()
print(data)

0     632
1    1638
2    1000
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: int64

   patient          phylum  value
0        1      Firmicutes    632
1        1  Proteobacteria   1638
2        1  Actinobacteria    569
3        1   Bacteroidetes    115
4        2      Firmicutes    433
5        2  Proteobacteria      0
6        2  Actinobacteria    754
7        2   Bacteroidetes    555


We can also modify existing values by assignment

In [None]:
data.loc[3, 'value'] = 14 # modify the 4th element of column 'value'
print(data)

   patient          phylum  value
0        1      Firmicutes    632
1        1  Proteobacteria   1638
2        1  Actinobacteria    569
3        1   Bacteroidetes     14
4        2      Firmicutes    433
5        2  Proteobacteria      0
6        2  Actinobacteria    754
7        2   Bacteroidetes    555


We can do the same using the `.iloc` attribute

In [None]:
data.iloc[2, 0] = 3 # modify the third element of the first column
print(data)

   patient          phylum  value
0        1      Firmicutes    632
1        1  Proteobacteria   1638
2        3  Actinobacteria    569
3        1   Bacteroidetes     14
4        2      Firmicutes    433
5        2  Proteobacteria      0
6        2  Actinobacteria    754
7        2   Bacteroidetes    555


Or modily several elements at the same time

In [None]:
data['patient'] = data['patient'] + 10
print(data)

   patient          phylum  value
0       11      Firmicutes    632
1       11  Proteobacteria   1638
2       13  Actinobacteria    569
3       11   Bacteroidetes     14
4       12      Firmicutes    433
5       12  Proteobacteria      0
6       12  Actinobacteria    754
7       12   Bacteroidetes    555


And we can also create new variables by assignment

In [None]:
data['year'] = 2013
print(data)

   patient          phylum  value  year
0       11      Firmicutes    632  2013
1       11  Proteobacteria   1638  2013
2       13  Actinobacteria    569  2013
3       11   Bacteroidetes     14  2013
4       12      Firmicutes    433  2013
5       12  Proteobacteria      0  2013
6       12  Actinobacteria    754  2013
7       12   Bacteroidetes    555  2013


Three ways to add different values.

In [None]:
#data['age'] = np.repeat([35,60],4) # we can add different values by using a list, an 1D array or a series
#data['age'] = [30, 30, 30, 30, 65, 65, 65, 65]
data['age'] = pd.Series([20, 20, 20, 20, 45, 45, 45, 45]) # NaN values can be added only with a Series
print(data)

   patient          phylum  value  year  age
0       11      Firmicutes    632  2013   20
1       11  Proteobacteria   1638  2013   20
2       13  Actinobacteria    569  2013   20
3       11   Bacteroidetes     14  2013   20
4       12      Firmicutes    433  2013   45
5       12  Proteobacteria      0  2013   45
6       12  Actinobacteria    754  2013   45
7       12   Bacteroidetes    555  2013   45


The method `.eval` can be used to create new columns based on the existing ones.

In [None]:
patient_plus_value = data.eval('patient + value')
print(patient_plus_value)
print()
print(data)

0     643
1    1649
2     582
3      25
4     445
5      12
6     766
7     567
dtype: int64

   patient          phylum  value  year  age
0       11      Firmicutes    632  2013   20
1       11  Proteobacteria   1638  2013   20
2       13  Actinobacteria    569  2013   20
3       11   Bacteroidetes     14  2013   20
4       12      Firmicutes    433  2013   45
5       12  Proteobacteria      0  2013   45
6       12  Actinobacteria    754  2013   45
7       12   Bacteroidetes    555  2013   45


In [None]:
data.eval('pv = patient + value', inplace=True) # we need to use the inplace argument for writing changes in the data
print(data)

   patient          phylum  value  year  age    pv
0       11      Firmicutes    632  2013   20   643
1       11  Proteobacteria   1638  2013   20  1649
2       13  Actinobacteria    569  2013   20   582
3       11   Bacteroidetes     14  2013   20    25
4       12      Firmicutes    433  2013   45   445
5       12  Proteobacteria      0  2013   45    12
6       12  Actinobacteria    754  2013   45   766
7       12   Bacteroidetes    555  2013   45   567


Usually we want to delete columns that are irrelevant to our analysis. We can do that using the keyword `del`.

In [None]:
del data['pv']
print(data)

   patient          phylum  value  year  age
0       11      Firmicutes    632  2013   20
1       11  Proteobacteria   1638  2013   20
2       13  Actinobacteria    569  2013   20
3       11   Bacteroidetes     14  2013   20
4       12      Firmicutes    433  2013   45
5       12  Proteobacteria      0  2013   45
6       12  Actinobacteria    754  2013   45
7       12   Bacteroidetes    555  2013   45


Or the `.drop()` method.

In [None]:
data.drop('year', axis=1, inplace=True) # again we need to use the inplace argument
print(data)

   patient          phylum  value  age
0       11      Firmicutes    632   20
1       11  Proteobacteria   1638   20
2       13  Actinobacteria    569   20
3       11   Bacteroidetes     14   20
4       12      Firmicutes    433   45
5       12  Proteobacteria      0   45
6       12  Actinobacteria    754   45
7       12   Bacteroidetes    555   45


To transform a `DataFrame` to a NumPy array we use the `.values` attribute.

In [None]:
va = data[['value', 'age']]
print(va)
print()
va_numpy = va.values
print(va_numpy)
print(va_numpy.shape)
print(type(va_numpy))

   value  age
0    632   20
1   1638   20
2    569   20
3     14   20
4    433   45
5      0   45
6    754   45
7    555   45

[[ 632   20]
 [1638   20]
 [ 569   20]
 [  14   20]
 [ 433   45]
 [   0   45]
 [ 754   45]
 [ 555   45]]
(8, 2)
<class 'numpy.ndarray'>


###Loading data



In [None]:
mb = pd.read_csv("/content/drive/My Drive/ICT - AI/Study Units/ARI5102 - Data Analysis Techniques/Tutorials/microbiome.csv")
shape = mb.shape
print(shape)

(75, 4)


In [None]:
mb.head() # it prints the first 5 rows of the dataframe

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


In [None]:
mb.tail() # it prints the last 5 rows of the dataframe

Unnamed: 0,Taxon,Patient,Tissue,Stool
70,Other,11,203,6
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing the argument `header`.

In [None]:
mb_no_header = pd.read_csv("/content/drive/My Drive/ICT - AI/Study Units/ARI5102 - Data Analysis Techniques/Tutorials/microbiome.csv", header=None)
print(mb_no_header.head())

            0        1       2      3
0       Taxon  Patient  Tissue  Stool
1  Firmicutes        1     632    305
2  Firmicutes        2     136   4182
3  Firmicutes        3    1174    703
4  Firmicutes        4     408   3946


If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument

In [None]:
mb_skip = pd.read_csv("/content/drive/My Drive/ICT - AI/Study Units/ARI5102 - Data Analysis Techniques/Tutorials/microbiome.csv", skiprows=[2,3,4])
mb_skip.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,5,831,8605
2,Firmicutes,6,693,50
3,Firmicutes,7,718,717
4,Firmicutes,8,173,33


Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`

In [None]:
mb_4rows = pd.read_csv("/content/drive/My Drive/ICT - AI/Study Units/ARI5102 - Data Analysis Techniques/Tutorials/microbiome.csv", nrows=4)
print(mb_4rows)

        Taxon  Patient  Tissue  Stool
0  Firmicutes        1     632    305
1  Firmicutes        2     136   4182
2  Firmicutes        3    1174    703
3  Firmicutes        4     408   3946


Sometimes the data contain missing values


In [None]:
mb = pd.read_csv("/content/drive/My Drive/ICT - AI/Study Units/ARI5102 - Data Analysis Techniques/Tutorials/microbiome_missing.csv", nrows=10)
print(mb)

        Taxon  Patient  Tissue   Stool
0  Firmicutes        1   632.0   305.0
1  Firmicutes        2   136.0  4182.0
2  Firmicutes        3     NaN   703.0
3  Firmicutes        4   408.0  3946.0
4  Firmicutes        5   831.0  8605.0
5  Firmicutes        6   693.0    50.0
6  Firmicutes        7   718.0   717.0
7  Firmicutes        8   173.0    33.0
8  Firmicutes        9   228.0     NaN
9  Firmicutes       10   162.0  3196.0


Again, we can use the `.isnull()` method to detect the null values.


In [None]:
print(mb.isnull())

   Taxon  Patient  Tissue  Stool
0  False    False   False  False
1  False    False   False  False
2  False    False    True  False
3  False    False   False  False
4  False    False   False  False
5  False    False   False  False
6  False    False   False  False
7  False    False   False  False
8  False    False   False   True
9  False    False   False  False


To fill the null values we can use the `.fillna()` method.


In [None]:
mb_nan_to_fixed = mb.fillna(0) #it fills the null values with zeros
print(mb_nan_to_fixed)


        Taxon  Patient  Tissue   Stool
0  Firmicutes        1   632.0   305.0
1  Firmicutes        2   136.0  4182.0
2  Firmicutes        3     0.0   703.0
3  Firmicutes        4   408.0  3946.0
4  Firmicutes        5   831.0  8605.0
5  Firmicutes        6   693.0    50.0
6  Firmicutes        7   718.0   717.0
7  Firmicutes        8   173.0    33.0
8  Firmicutes        9   228.0     0.0
9  Firmicutes       10   162.0  3196.0


In [76]:
mb_nan_to_fixed_per_column = mb.fillna(value={'Tissue':100, 'Stool':25}) # it fills the null values of the Tissue column with 100 and
print(mb_nan_to_fixed_per_column)                                        # the null values of the Stool column with 25 and

        Taxon  Patient  Tissue   Stool
0  Firmicutes        1   632.0   305.0
1  Firmicutes        2   136.0  4182.0
2  Firmicutes        3   100.0   703.0
3  Firmicutes        4   408.0  3946.0
4  Firmicutes        5   831.0  8605.0
5  Firmicutes        6   693.0    50.0
6  Firmicutes        7   718.0   717.0
7  Firmicutes        8   173.0    33.0
8  Firmicutes        9   228.0    25.0
9  Firmicutes       10   162.0  3196.0


In [77]:
mb_fillna_based_on_data = mb.fillna(value={'Tissue':mb['Tissue'].median(), 'Stool':mb['Stool'].mean()}) # it fills the null values of the Tissue column with the median value of the cloumn
print(mb_fillna_based_on_data)                                                                          # and the null values of the Stool column with mean value of the column

        Taxon  Patient  Tissue        Stool
0  Firmicutes        1   632.0   305.000000
1  Firmicutes        2   136.0  4182.000000
2  Firmicutes        3   408.0   703.000000
3  Firmicutes        4   408.0  3946.000000
4  Firmicutes        5   831.0  8605.000000
5  Firmicutes        6   693.0    50.000000
6  Firmicutes        7   718.0   717.000000
7  Firmicutes        8   173.0    33.000000
8  Firmicutes        9   228.0  2415.222222
9  Firmicutes       10   162.0  3196.000000
