## **Data Analysis in Pandas**

### Introduction to Python Libraries

A python **library** is a collection of codes or modules of codes that we can use in a program for specific operations. It's a pre-written piece of software that you can re-use rather than having to write the code yourself from scratch.

### **Python Libraries for Data Science**

#### **Scientific Computation**

Scientific computing requires tools, techniques and theories to solve mathematical problems in Science and Engineering. As a data scientist you will be required to convert data into an easy-to-process format. Data stored in a computer may become too large to be processed efficiently by python's native lists and dictionaries and using python's built-in methods.The following libraries add scientific computation abilities to Python for working efficiently with larger data sets.

##### **NumPy**

**NumPy**(Numerical Python) is the most fundamental package used for scientific computation. It provides alot of useful functionality for mathematical operations on vectors and matrices in Python. The library provides these mathematical operations using the NumPy array data type which enhances performance and speeds up execution.

##### **SciPy**

Python's **SciPy** stack is mainly used to perform scientific experiments. The SciPy library is built on top of NumPy. It provides efficient numerical computational routines and comes packaged with a number of specific submodules. Some of the modules from this library which commonly apply to Data Science experiments include:

- <font color = 'red'> stats </font> : Statistical Functions
- <font color = 'red'> linalg </font> : Linear Algebra Routines
- <font color = 'red'> optimize </font> : Optimization Algorithm including  Linear Programming

##### **Statsmodels**

**Statsmodels** enables users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis. It provides a comprehensive set of descriptive statistics. The library provides insights when diagnosing issues with linear regression, robust linear models, time series analysis models among others with various estimators.

##### **Pandas**

**Pandas** is a python package designed to work with tabular(relational) data and helps replicate the functionality of relational database in a straighforward and easy to grasp way. Pandas is a great tool for data cleaning.

The two data structures in a pandas library include:
1. Series - one-dimensional data structure
2. DataFrame - two-dimensional data structure

#### **Data Visualization**

Among the common tasks you will be required to do as a Data Scientist is to create drawing visualizations. Python has a good library to support data visualization from plotting routine visualizations in Matplotlib to
developing graphical dashboards in Plotly.

##### **Matplotlib**

**Matplotlib** library is tailored for the generation of simple and powerful visualizations. You can make just about any visualization with matplotlib including:
1. Line Plots
2. Scatter Plots
3. Bar Charts
4. Histograms
5. Pie charts

Everything in matplotlib is customizable from labels, grids to legends.

##### **Seaborn**

**Seaborn** is built on top of Matplotlib and it specifically targets statistical data visualizations. Seaborn extends the functionality of matplotlib and that's why it can address the two big issues with matplotlib - the quality of plots and parameter defaults. Since Seaborn is an extension of matplotlib, if you know Matplotlib you already have most of Seaborn down.

#### **Machine Learning**

##### **Scikit-Learn**

In machine learning, one of the most used libraries is **Scikit-Learn**. The package makes heavy use of its mathematical operations to model and test computational algorithms. The library combines quality code and good documentation, ease of use and high performance and has become industry standard for machine learning with Python. Common problems solved by Scikit-Learn(sometimes abbreviated as sklearn) machine learning algorithms include classification, regression, clustering and dimension reductionality.

You can find an interactive machine learning map below:
**[Here](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)**

![](https://drive.google.com/uc?id=1hV4NFW42ZCLKvQjx3zohhGw_TwTRP-au)

#### **Deep Learning**

##### **TensorFlow**

**TensorFlow** is an open source library which was originally developed by a team of researchers and machine learning engineers at Google to meet the performance of training Deep Neural Networks in order to analyze visual and textual data. The key feature of TensorFlow is its multi-layered nodes system that enables quick training of artificial neural networks on big data. This is the library that powers Google’s voice recognition and object recognition in real time.

##### **Keras**

**Keras** is a user-friendly open source library for building neural networks with a high-level of interface abstraction. The Keras library is written in python and so python developers find it much easier to start coding neural networks in Keras than in Tensorflow. Keras is really easy to get started with and for quick prototyping, it is highly modular and extensible. Notwithstanding its ease, simplicity, and high-level orientation, Keras is still deep and powerful enough for serious modeling.

### **Understanding Pandas Series and DataFrames**

Pandas contains two main data structures:

1. **Pandas Series** - One-dimensional object that can hold any data type i.e strings, floats, integers, python obejects etc. The axis (also row names) are called index. In simple terms, a pandas series is like an excel column

2. **Pandas DataFrame** - Two-dimensional object that represents data in a tabular format. Typically, a pandas DataFrame contains columns and rows. It can also hold different data types

Pandas Series and Pandas DataFrame are two of the main data structures you will work with in the Pandas Library. A pandas series will allow you to transform its values. You can also make modifications of the columns of a pandas DataFrame. Some of the transformations you can make while working with a Pandas Data Structure are dropping column names, converting dates, setting a new index, renaming columns, cleaning columns and many others. We will take a deep look at this during Data Preparation.

Below is how a pandas Series and a pandas DataFrame would look like;


In [None]:
# Pandas Series

# import pandas library and alias it as pd

import pandas as pd

# create a list of student names

student_names = ['Zainab', 'Eshe', 'Jane', 'Hellen', 'Hassan']

# convert the list to a pandas series using the function pd.Series()

student_names_series = pd.Series(student_names)

# call the student_names_series variable to preview the pandas series

student_names_series

0    Zainab
1      Eshe
2      Jane
3    Hellen
4    Hassan
dtype: object

In [None]:
# check the data type using the type() function in python

type(student_names_series)

pandas.core.series.Series

In [None]:
# Pandas DataFrame

# import pandas library and alias it as pd

import pandas as pd

# create a dictionary of three students containing their names, age, location
# and the programming languages they use
# assign the dictionary to variable called data

data = {'name': ['Kimani', 'Jemima', 'Wambui'],
        'age': [22, 26, 32],
        'location': ['Kiambu', 'Machakos', 'Kakamega'],
        'prog_lang' : [['html/css', 'java'], ['python', 'scala'], ['golang', 'javascript']]
        }

# convert the dictionary to a pandas DataFrame using the function pd.DataFrame()

data = pd.DataFrame(data)

# call the variable data to preview the DataFrame

data


Unnamed: 0,name,age,location,prog_lang
0,Kimani,22,Kiambu,"[html/css, java]"
1,Jemima,26,Machakos,"[python, scala]"
2,Wambui,32,Kakamega,"[golang, javascript]"


In [None]:
# call the function type() to check the data type of data

type(data)

pandas.core.frame.DataFrame

### **Accessing Data with Pandas**

Pandas makes accessing and manipulatig data in python simple and efficient. We will dig into the various methods for accessing data from our pandas Series and DataFrame.

Let's go ahead and import pandas first, make sure to alias it as pd

In [None]:
 # import pandas

 import pandas as pd

We will go ahead and use the breast cancer dataset in the Scikit Learn Library. Do not worry about the code, this is just to make sure you  have access to the breast cancer dataset

In [None]:
# import necessary libraries

from sklearn.datasets import load_breast_cancer

# load the breast cancer data

breast_cancer_data = load_breast_cancer()

# create a DataFrame from the data and assign column names to feature names

df = pd.DataFrame(breast_cancer_data.data, columns = breast_cancer_data.feature_names)

# display the DataFrame

df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


Great ! Now say you only wanted to see a few lines of data based on certain constraints. This is where accessing data comes in. There some methods and functions which make retrieving of information from data very easy. Some commonly used methods include:

 - **.head()**
 - **.tail()**

 And attributes:

 - **.columns**
 - **.index**
 - **.dtypes**
 - **.shape**

 Let's go ahead and take a look at some of the methods:

 .head() and .tail() allow you to select the number of rows you want to preview. The default number of rows is usually 5. An example

In [None]:
# first 5 rows of df

df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
# last 3 rows of df

df.tail(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,18.98,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,25.74,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,9.456,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039


To get a a concise summary of the dataset you can use .info()

In [None]:
# display information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

Some attributes

Using **.index** you can access the index or row labels of the DataFrame

In [None]:
# access the index of the DataFrame

df.index

RangeIndex(start=0, stop=569, step=1)

Using .columns to access the column names of the DataFrame

In [None]:
# access column names of the DataFrane

df.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

Using .dtypes returns the data types of the columns

In [None]:
# access data types of the columns

df.dtypes

mean radius                float64
mean texture               float64
mean perimeter             float64
mean area                  float64
mean smoothness            float64
mean compactness           float64
mean concavity             float64
mean concave points        float64
mean symmetry              float64
mean fractal dimension     float64
radius error               float64
texture error              float64
perimeter error            float64
area error                 float64
smoothness error           float64
compactness error          float64
concavity error            float64
concave points error       float64
symmetry error             float64
fractal dimension error    float64
worst radius               float64
worst texture              float64
worst perimeter            float64
worst area                 float64
worst smoothness           float64
worst compactness          float64
worst concavity            float64
worst concave points       float64
worst symmetry      

Using .shape returns the number of rows and columns of the DataFrame in a tuple

In [None]:
# retrieve the dimensions of the DataFrame

df.shape

(569, 30)

#### Selecting DataFrame Information

We have two most important attributes when selecting dataframe information that we left out

1. **iloc** pandas DataFrame indexer used for integer location based indexing /selection by position

2. **loc** has two casess
- selecting by label/index
- selecting with a boolean / conditional look up

You can use **.iloc** to select single rows. To select the fourth row you can use **.iloc[3]**

In [None]:
# select the 4th row in the DataFrame

df.iloc[3]

mean radius                 11.420000
mean texture                20.380000
mean perimeter              77.580000
mean area                  386.100000
mean smoothness              0.142500
mean compactness             0.283900
mean concavity               0.241400
mean concave points          0.105200
mean symmetry                0.259700
mean fractal dimension       0.097440
radius error                 0.495600
texture error                1.156000
perimeter error              3.445000
area error                  27.230000
smoothness error             0.009110
compactness error            0.074580
concavity error              0.056610
concave points error         0.018670
symmetry error               0.059630
fractal dimension error      0.009208
worst radius                14.910000
worst texture               26.500000
worst perimeter             98.870000
worst area                 567.700000
worst smoothness             0.209800
worst compactness            0.866300
worst concav

In [None]:
# select the 10th row in the DataFrame

df.iloc[9]

mean radius                 12.460000
mean texture                24.040000
mean perimeter              83.970000
mean area                  475.900000
mean smoothness              0.118600
mean compactness             0.239600
mean concavity               0.227300
mean concave points          0.085430
mean symmetry                0.203000
mean fractal dimension       0.082430
radius error                 0.297600
texture error                1.599000
perimeter error              2.039000
area error                  23.940000
smoothness error             0.007149
compactness error            0.072170
concavity error              0.077430
concave points error         0.014320
symmetry error               0.017890
fractal dimension error      0.010080
worst radius                15.090000
worst texture               40.680000
worst perimeter             97.650000
worst area                 711.400000
worst smoothness             0.185300
worst compactness            1.058000
worst concav

You can use a colon to select several rows. Note that you will use a structure **iloc[a:b]** where the row with index a will be included in the selection while the row with index b will be excluded

In [None]:
# retrieve rows from index 4 to 6 using integer location

df.iloc[4:7]

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368


You can also perform column and row selection at once:

In [None]:
# retrieve a subset of rows from (index 3 to 8 )
# and columns from index (index 0 to 4)
# using integer location based indexing

df.iloc[3:9, 0:5]

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness
3,11.42,20.38,77.58,386.1,0.1425
4,20.29,14.34,135.1,1297.0,0.1003
5,12.45,15.7,82.57,477.1,0.1278
6,18.25,19.98,119.6,1040.0,0.09463
7,13.71,20.83,90.2,577.9,0.1189
8,13.0,21.82,87.5,519.8,0.1273


**.loc**

**.loc** -- label based indexing

You can use .loc to select columns based on their (row index and) column name. Example

In [None]:
# retrieve the mean area column for all the rows using label bases indexing

df.loc[:, 'mean area']

0      1001.0
1      1326.0
2      1203.0
3       386.1
4      1297.0
        ...  
564    1479.0
565    1261.0
566     858.1
567    1265.0
568     181.0
Name: mean area, Length: 569, dtype: float64

In [None]:
# retrive the mean area column for the rows in index 6 upto 12 using label based indexing

df.loc[6:12, 'mean area']

6     1040.0
7      577.9
8      519.8
9      475.9
10     797.8
11     781.0
12    1123.0
Name: mean area, dtype: float64

Note that when using label indexing with **.loc** in Python, both the start and stop labels are included in the selection. Unlike typical Python slices, where the stop label is exclusive, **.loc** behaves differently in this regard. There has been some curiosity and discussion in the community about why **.loc** deviates from Python's standard conventions in this specific aspect, but no definitive answer has been provided.

**.loc** -- boolean indexing using .loc

Sometimes you'd like to select certain rows in your dataset based on the value for a certain variable. Imagine you'd like to create a new DataFrame that only contains the breast cancer data with a mean area spread of below 500. This can be done as follows:

In [None]:
df.loc[df['mean area'] < 500]

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
3,11.420,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.68690,0.25750,0.6638,0.17300
5,12.450,15.70,82.57,477.1,0.12780,0.17000,0.157800,0.08089,0.2087,0.07613,...,15.470,23.75,103.40,741.6,0.17910,0.52490,0.53550,0.17410,0.3985,0.12440
9,12.460,24.04,83.97,475.9,0.11860,0.23960,0.227300,0.08543,0.2030,0.08243,...,15.090,40.68,97.65,711.4,0.18530,1.05800,1.10500,0.22100,0.4366,0.20750
21,9.504,12.44,60.34,273.9,0.10240,0.06492,0.029560,0.02076,0.1815,0.06905,...,10.230,15.66,65.13,314.9,0.13240,0.11480,0.08867,0.06227,0.2450,0.07773
31,11.840,18.70,77.93,440.6,0.11090,0.15160,0.121800,0.05182,0.2301,0.07799,...,16.820,28.12,119.40,888.7,0.16370,0.57750,0.69560,0.15460,0.4761,0.14020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
556,10.160,19.59,64.73,311.7,0.10030,0.07504,0.005025,0.01116,0.1791,0.06331,...,10.650,22.88,67.88,347.3,0.12650,0.12000,0.01005,0.02232,0.2262,0.06742
557,9.423,27.88,59.26,271.3,0.08123,0.04971,0.000000,0.00000,0.1742,0.06059,...,10.490,34.24,66.50,330.6,0.10730,0.07158,0.00000,0.00000,0.2475,0.06969
559,11.510,23.93,74.52,403.5,0.09261,0.10210,0.111200,0.04105,0.1388,0.06570,...,12.480,37.16,82.28,474.2,0.12980,0.25170,0.36300,0.09653,0.2112,0.08732
561,11.200,29.37,70.67,386.0,0.07449,0.03558,0.000000,0.00000,0.1060,0.05502,...,11.920,38.30,75.19,439.6,0.09267,0.05494,0.00000,0.00000,0.1566,0.05905


### **Importing Data with Pandas**

Like we said before, Pandas is a very popular library used for Data Wrangling/Data Cleaning. It's particularyly optimized to work with tabular (relational) data. Here we will learn how to import data as a Pandas DataFrame object, how to access and manipulate the data and how to export DataFrames to some common file formats.

While importing pandas ,its standard to import it under the alias pd

There are a few main functions for importing data into a Pandas DataFrame including:

- **pd.read_csv()**
- **pd.read_excel()**
- **pd.read_json()**
- **pd.DataFrame.from_dict()**

Most of these functions are quite straightforward. We use read_csv() for csv files, read_excel() for excel files (both new and old .xlx and .xlsx formats) and read_json() for json files. That said, there are a few nuances you should be aware of . The read_csv format can be used for any plain-text delimited file. This can include pipe delimmited file and tab separated files.

Let's take a look at an example.

You can access the data to be used **[Here](https://github.com/atienosonia/PwaniTeknowGirls/blob/master/wineq-white.csv)**

Make sure to click the link. On your right end where you see raw, next to the raw text there is an icon for downloading the data. It's an arrow pointing downwards. Click on that arrow and note the location you have downloaded your data. Move back to Google Colab. On your far left there is a folder icon. Click on it. Among the icons displayed you will see one that states upload to session storage.It has an upward arrow. Click on it. Navigate to where you downloaded the csv file and upload it.

In [None]:
# import necessary libraries

import pandas as pd

# load the data, make sure to specify the correct path
# to get the file path just right click on the file uploaded
# you will see an option to copy path, click on that then
# paste the path onto the function pd.read_csv()

data = pd.read_csv('/content/wineq-white.csv')

# look at the first ten rows

data.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6


##### **Header**
 When working with data, you may encounter situations where the first row of your dataset doesn't contain column names but rather contains actual data. In such cases, you can specify where the actual column names begin, allowing you to correctly interpret and work with your data. This is often referred to as specifying the 'header' or indicating the location of the header row in your dataset.

##### **Selecting Specific Columns**

You can select specific columns if you only want to load certain features

In [None]:
# import the data file with specific columns
# use the same wine dataset

df = pd.read_csv('/content/wineq-white.csv', usecols = [0, 1, 2, 3])

# preview the data

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar
0,7.0,0.27,0.36,20.7
1,6.3,0.3,0.34,1.6
2,8.1,0.28,0.4,6.9
3,7.2,0.23,0.32,8.5
4,7.2,0.23,0.32,8.5


Pandas read_csv() has quite a number of parameters you can pass to make sure your data gets loaded in the right format and you have all the information you need. You can check the documentation **[Here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)** to get a good grasp of the parameters and when to use them.

##### **Saving Data**

Once you have downloaded your data and done some cleaning to it, you may want to export  it back out. We use **.to_csv()** or **.to_excel()** methods of any DataFrame object to do so.

In [None]:
# write data to a csv file

df.to_csv('new_saved_file.csv', index = False)

In [None]:
# write data to an excel file

df.to_excel('new_saved_file.xlsx')

### **Statistical Methods in Pandas**

The first step while working with a new dataset is to always begin to understand what the dataset is made up of. The pandas DataFrame class contains two built in methods that make this very easy for us

Using <font color = 'red'>df.info()</font>

**df.info()** provides us with a summary metadata about our DataFrame - this is to say it gives us data about our Dataset, such as how many rows and columns it contains and the data types that is stored on the DataFrame.

Let's demonstrate this using the titanic dataset. You can get the dataset from **[Here](https://github.com/atienosonia/PwaniTeknowGirls/blob/master/train.csv)**

Follow the instructions on loading the wine quality data to load the titanic data to your Google Colab


In [None]:
# import necessary libraries

import pandas as pd

# load the data

df_2 = pd.read_csv('/content/train.csv')

# display the dataframe information

df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


From the output above, the method provides us with great information about the characteristics of the DataFrame. Examine the output above and take note of the most important things it tells us about the DataFrame such as :
- The number of columns and rows in the DataFrame
- The data type of the data each column contais
- How many values each column contains(NaNs are not counted)
- The memory footprint of the data

This sort of information is called **metadata** since its data about our data.

Using <font color = 'red'>df.describe()</font>

While conducting Data Analysis, it's usually important to dig into the summary statistics of the dataset, and get a feel for the data each column contains. Rather than force us to deal with the tedium of doing this individually for every column, Pandas DataFrames provide the handy **df.describe()** method which calculates the basic summary statistics for each column.

See the example in the cell below.

In [None]:
# generate the descriptive statistics for the DataFrame df_2

df_2.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The method gives us very relevant information such as :
- a **count** of the number of values in each column, making it easy to identify columns with missing values
- The mean and standard deviation of each column
- The minimum and maximum values found in each column
- The median (50%) and quartile values (25% & 75%) for each column

When you want to get a quick feel of your dataset before proceeding to conduct Exploratory Data Analysis then you should make use of the describe method.

##### **Calculating Individual Column Statistics**

Pandas DataFrame and Series object comes with a plethora of built-in methods to instantly calculate summary statistics. When in need to calculate individual statistics about a column then we can easily do this. See the code blocks below:

In [None]:
# calculate mean(average) values of numeric columns in the DataFrame df_2

df_2.mean(numeric_only = True)

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

In [None]:
# calculate mean (average) of fare values in the df_2 DataFrame

df_2['Fare'].mean()

32.204207968574636

In [None]:
# calculate median value of the age column in df_2 DataFrame

df_2['Age'].median()

28.0

There are many different statistical methods built into Pandas DataFrames -- these are just a few! We will not list all of them, but here are some common ones you'll probably make use of early and often:

- <font color = 'red'> .mode() </font> -- the mode of the column
- <font color = 'red'> .count() </font> -- the count of the total number of entries in a column
- <font color = 'red'> .std() </font> -- the standard deviation for the column
- <font color = 'red'>.var()</font> -- the variance for the column
- <font color = 'red'>.sum()</font> -- the sum of all values in the column
- <font color = 'red'>.cumsum()</font> -- the cumulative sum, where each cell index contains the sum of all indices lower than, and including, itself.

##### **Summary Statistics for Categorical Data**


We cannot calculate most summary statistics on columns that contain non-numeric data -- there's no way for us to find the mean of the letters in the Embarked column (embarked - Port of Embarkation [C = Cherbourg; Q = Queenstown; S = Southampton]), for instance. However, there are some summary statistics we can use to help us better understand our categorical columns.

See the examples in the cell below:

In [None]:
# retrieve unique values in the Embarked column from DataFrame df_2

df_2['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [None]:
# count the frequency of each unique values in the embarked column

df_2['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

These methods are extremely useful when dealing with categorical data!

<font color = 'red'>.unique()</font> shows us all the unique values contained in the column.

<font color = 'red'>.value_counts()</font> shows us a count for how many times each unique value is present in a dataset, giving us a feel for the distribution of values in the column.