# Pandas Getting Started

-> Pandas is used to analyze, clean, exploring and manipulating data .

## Pandas Installation

In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.




## Importing Pandas

In [3]:
import pandas as pd

###  Checking the pandas version

In [5]:
print(pd.__version__)

1.4.4


## Pandas Series 

In [8]:
num = [2, 3, 5, 7] # list of numbers

series = pd.Series(num) # pandas series

print(series)

0    2
1    3
2    5
3    7
dtype: int64


###  Labelling

-> By default pandas series are indexed

In [13]:
series_1 = pd.Series(num, index = ['A', 'B', 'C', 'D'])

print(series_1)

A    2
B    3
C    5
D    7
dtype: int64


### Accessing Elements Using Label

In [14]:
print(series_1['B']) #prints the second element in series_1 variable.

3


###  Key/values object as Series

In [15]:
days = {'sunday': 1, 'Monday': 2, 'tuesday': 3, 'wednesday': 4, 'thursday': 5, 'friday': 6, 'saturday': 7}
day_series = pd.Series(days)

print(days)

{'sunday': 1, 'Monday': 2, 'tuesday': 3, 'wednesday': 4, 'thursday': 5, 'friday': 6, 'saturday': 7}


##  Dataframes

-> Data sets in pandas are usually multi-dimensional table

In [19]:
data = {'Names': ['Kevin', 'Brian', 'Winney', 'Violet'],
       'Ages': [32, 21, 18, 20]}

data_set = pd.DataFrame(data, index = [1, 2, 3, 4]) # loading data into a DataFrame object.

print(data_set)

    Names  Ages
1   Kevin    32
2   Brian    21
3  Winney    18
4  Violet    20


### Locating Row

-> A DataFrame is like a table with both columns and rows

#### loc

-> is pandas function that return one or more specified row(sM)

In [29]:
print(data_set.loc[1])

Names    Kevin
Ages        32
Name: 1, dtype: object


In [30]:
print(data_set.loc[[1, 2]])

   Names  Ages
1  Kevin    32
2  Brian    21


## Loading Files into a DataFrame

In [33]:
df = pd.read_csv('Iris.csv')

print(df)

        CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  \
0    0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296   
1    0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242   
2    0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242   
3    0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222   
4    0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...  ...   
501  0.06263   0.0  11.93   0.0  0.573  6.593  69.1  2.4786    1  273   
502  0.04527   0.0  11.93   0.0  0.573  6.120  76.7  2.2875    1  273   
503  0.06076   0.0  11.93   0.0  0.573  6.976  91.0  2.1675    1  273   
504  0.10959   0.0  11.93   0.0  0.573  6.794  89.3  2.3889    1  273   
505  0.04741   0.0  11.93   0.0  0.573  6.030   NaN  2.5050    1  273   

     PTRATIO       B  LSTAT  MEDV  
0       15.3  396.90   4.98  24.0  
1       17.8  396.90   9.14  21.6  
2       17.8  3

## Read CSV Files

-> A simple way to store big data set is to use comma separated values files(CSV)


In [37]:
print(df.to_string()) # to_string() is used to print entire DataFrame.

         CRIM     ZN  INDUS  CHAS     NOX     RM    AGE      DIS  RAD  TAX  PTRATIO       B  LSTAT  MEDV
0     0.00632   18.0   2.31   0.0  0.5380  6.575   65.2   4.0900    1  296     15.3  396.90   4.98  24.0
1     0.02731    0.0   7.07   0.0  0.4690  6.421   78.9   4.9671    2  242     17.8  396.90   9.14  21.6
2     0.02729    0.0   7.07   0.0  0.4690  7.185   61.1   4.9671    2  242     17.8  392.83   4.03  34.7
3     0.03237    0.0   2.18   0.0  0.4580  6.998   45.8   6.0622    3  222     18.7  394.63   2.94  33.4
4     0.06905    0.0   2.18   0.0  0.4580  7.147   54.2   6.0622    3  222     18.7  396.90    NaN  36.2
5     0.02985    0.0   2.18   0.0  0.4580  6.430   58.7   6.0622    3  222     18.7  394.12   5.21  28.7
6     0.08829   12.5   7.87   NaN  0.5240  6.012   66.6   5.5605    5  311     15.2  395.60  12.43  22.9
7     0.14455   12.5   7.87   0.0  0.5240  6.172   96.1   5.9505    5  311     15.2  396.90  19.15  27.1
8     0.21124   12.5   7.87   0.0  0.5240  5.631  100.0

###  Max_rows

-> The number of rows returned is defined in optional settings

-> You can check your system's maximum_rows with the:
  pd.options.display.max.rows statement

In [38]:
print(pd.options.display.max_rows)

60


###  Increasing the number of rows to display the entire DataFrame

In [40]:
pd.options.display.max_rows = 9999

df = pd.read_csv('Iris.csv')
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


#  Analyzing DataFrame

##  Viewing Data

#### head()

-> head() function returns the first five rows of the Dataframe by default unless specified

In [45]:
print(df.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222     18.7   

        B  LSTAT  MEDV  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
4  396.90    NaN  36.2  


In [47]:
print(df.head().to_string()) # print the entire dataframe in a row

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO       B  LSTAT  MEDV
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3  396.90   4.98  24.0
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8  396.90   9.14  21.6
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8  392.83   4.03  34.7
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7  394.63   2.94  33.4
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222     18.7  396.90    NaN  36.2


####  tail()

-> prints the last five rows of the DataFrame by default unless specified

In [50]:
print(df.tail(10))

        CRIM   ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
496  0.28960  0.0   9.69   0.0  0.585  5.390  72.9  2.7986    6  391     19.2   
497  0.26838  0.0   9.69   0.0  0.585  5.794  70.6  2.8927    6  391     19.2   
498  0.23912  0.0   9.69   0.0  0.585  6.019  65.3  2.4091    6  391     19.2   
499  0.17783  0.0   9.69   0.0  0.585  5.569  73.5  2.3999    6  391     19.2   
500  0.22438  0.0   9.69   0.0  0.585  6.027  79.7  2.4982    6  391     19.2   
501  0.06263  0.0  11.93   0.0  0.573  6.593  69.1  2.4786    1  273     21.0   
502  0.04527  0.0  11.93   0.0  0.573  6.120  76.7  2.2875    1  273     21.0   
503  0.06076  0.0  11.93   0.0  0.573  6.976  91.0  2.1675    1  273     21.0   
504  0.10959  0.0  11.93   0.0  0.573  6.794  89.3  2.3889    1  273     21.0   
505  0.04741  0.0  11.93   0.0  0.573  6.030   NaN  2.5050    1  273     21.0   

          B  LSTAT  MEDV  
496  396.90  21.14  19.7  
497  396.90  14.10  18.3  
498  396.90  12.92  21.2  


In [52]:
print(df.tail().to_string())

        CRIM   ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO       B  LSTAT  MEDV
501  0.06263  0.0  11.93   0.0  0.573  6.593  69.1  2.4786    1  273     21.0  391.99    NaN  22.4
502  0.04527  0.0  11.93   0.0  0.573  6.120  76.7  2.2875    1  273     21.0  396.90   9.08  20.6
503  0.06076  0.0  11.93   0.0  0.573  6.976  91.0  2.1675    1  273     21.0  396.90   5.64  23.9
504  0.10959  0.0  11.93   0.0  0.573  6.794  89.3  2.3889    1  273     21.0  393.45   6.48  22.0
505  0.04741  0.0  11.93   0.0  0.573  6.030   NaN  2.5050    1  273     21.0  396.90   7.88  11.9


#### info()

-> give the information about the data

In [56]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
None


-> in above info() also give the  number of non-null values present in each columns.


-> In our data it seems there are 486 out of 506 in 'CRIM', 'ZN' etc.

-> This means there are 20 rows in those columns with no values

-> Empty values can be bad when analyzing data, and you should consider removing them.


## Data Cleaning in Pandas

-> Data cleaning means fixing bad data in your data set.

-> Bad data could be:
                    
                    *Empty cells
                    
                     *Data in wrong format
                    
                     *Wrong data
                    
                     *Duplicates

###  Empty cells

In [None]:
-> Empty cells can potentially give you a wrong result when you analyze data.

