## Table of Content

1. **[Pandas](#pandas)**

2. **[Data Structures](#structures)**
    
3. **[Pandas Series](#series)**
    - 3.1 - [Creating a Series](#creatingS)
    - 3.2 - [Manipulating Series](#manipulatingS)

4. **[Pandas DataFrames](#dataframes)**
    - 4.1 - [Creating DataFrames](#creatingDF)
    - 4.2 - [Manipulating DataFrames](#manipulatingDF)

5. **[Reading Data from Different Sources](#reading_data)**


<a id="pandas"> </a>
# 1. Pandas

Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.
<br><br>
                        While pandas adopt much code from NumPy, the difference is that Pandas is designed for tabular, heterogeneous data. NumPy, by difference, is best suited for working with homogeneous numerical array data.<br><br>
                         The name Pandas is derived from the term 'panel data' (an econometrics term for multidimensional structured data sets).
                

**How to install pandas?**<br>
1. You can use-<br>
`!pip install pandas`<br>
2. You can import it as 'pd'<br>
import pandas as pd

 Import the pandas library; the following convention is used

In [2]:
import pandas as pd
import numpy as np

So from now on, we will use `pd` instead of pandas. 

<a id="structures"> </a>
# 2. Data Structures

Pandas has two data structures as follows:<br>
1. A Series is a 1-dimensional labelled array that can hold data of any type (such as integer, string, boolean, float, python objects). Its axis labels are collectively called an index.<br>
2. A DataFrame is a 2-dimensional labelled data structure with columns. It supports multiple data types.

<a id="series"> </a>
# 3. Pandas Series

Pandas Series is a one-dimensional labelled array capable of holding any data type. However, a series is a sequence of similar data types, similar to an array, list, or column in a table. <br><Br>It will assign a labelled index to each item in the pd.Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

<a id="creatingS"> </a>
## 3.1 Creating a Series

**1. To create a numeric series** 

In [4]:
# create a numeric series
numbers = range(1,100,5)

# covert it to series
pd.Series(numbers)

0      1
1      6
2     11
3     16
4     21
5     26
6     31
7     36
8     41
9     46
10    51
11    56
12    61
13    66
14    71
15    76
16    81
17    86
18    91
19    96
dtype: int64

The output also gives the data type of the series as `int64`

Furthermore,  note that by default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

<b>In python, the row names are called index.</b>

**2. To create an object series** 

In [3]:
# create an object series
string = "Hi" , "How" ,"are", "you", "?"


# create a series from the above list
pd.Series(string)

0     Hi
1    How
2    are
3    you
4      ?
dtype: object

The output gives the data type of the series as `object`

**3. To create a series by giving both numeric and string values** 

In [4]:
# create a Series with an arbitrary list
pd.Series([345, 'London', 34.5, -34.45, 'Happy Birthday'])

0               345
1            London
2              34.5
3            -34.45
4    Happy Birthday
dtype: object

Here the numeric values are treated as object.

**4. To set index values for a series**

In [5]:
# declare a list of marks
marks = [60, 89, 74, 86]

# declare a list of subjects
subject = ["Maths", "Science", "English" , "Social Science"]

# create a series from the above list of marks with subjects as its index names
# index: adds the index 
marks_series = pd.Series(marks, index = subject) 
marks_series

Maths             60
Science           89
English           74
Social Science    86
dtype: int64

The index is added using the argument `index=`. The data type of the series continues to be numeric.

**5. To print the values and index of the Series**

In [6]:
# print the index of the series
marks_series.index

Index(['Maths', 'Science', 'English', 'Social Science'], dtype='object')

In [7]:
# prints the values of the series
marks_series.values

array([60, 89, 74, 86])

**6. To create a series from a dictionary**

In [8]:
# declare a dictionary
data = {'Maths': 60, 'Science': 89, 'English': 76, 'Social Science': 86}

# create a series from the dictionary
# the dictionary keys are the index names
# the dictionary values are the series values
pd.Series(data)

Maths             60
Science           89
English           76
Social Science    86
dtype: int64

On passing a `dict`, the index in the resulting Series will have the dict’s keys in given order.

**7. To create a series using library `numpy`**

In [9]:
# import the numpy library
import numpy as np

# declare the variable 'sequence' using linspace()
sequence = np.linspace(0,10, 5)

# print the array 'sequence'
sequence

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

We see that we obtain an array. We can now convert it to a Series.

In [10]:
# create a series from the above sequence
pd.Series(sequence)

0     0.0
1     2.5
2     5.0
3     7.5
4    10.0
dtype: float64

# -------------------------------------------------------------------------------------------
# Creating a series

In [11]:
list1=[80,90,100,110]
arr1=np.array([10,11,12,13])
dict1={'a':1,'b':2,'c':3}

s1=pd.Series(list1)
s2=pd.Series(list1,index=['row1','row2','row3','row4'])
s3=pd.Series(arr1,index=['row1','row2','row3','row4'])
s4=pd.Series(dict1)                                    # key becomes row lable
s4

a    1
b    2
c    3
dtype: int64

In [12]:
# Get index :

print(s2.index)
print(s2.values)

Index(['row1', 'row2', 'row3', 'row4'], dtype='object')
[ 80  90 100 110]


In [13]:
# Indexing :
print(s2[0])
print(s3[1])
print(s2['row1'])
print(s3['row1'])

80
11
80
10


In [14]:
# Slicing : 
print(s2[0:2],'\n')     # gives only at index position 0 & 1
print(s2['row1':'row3'])# gives all the values

row1    80
row2    90
dtype: int64 

row1     80
row2     90
row3    100
dtype: int64


In [15]:
# Vector :
print(s2[[0,2]],'\n')
print(s3[['row1','row3']])

row1     80
row3    100
dtype: int64 

row1    10
row3    12
dtype: int64


<a id="manipulatingS"> </a>
### 3.2 Manipulating Series 

**1. To know the subjects in which marks score is more than 75**

In [16]:
# check for marks more than 75
marks_series[marks_series > 75]

Science           89
Social Science    86
dtype: int64

**2. To assign 68 marks to 'Art and Craft'**

In [17]:
# assign 68 marks to 'Art and Craft'
marks_series["Art and Craft"] = 68

In [18]:
# print the series
marks_series

Maths             60
Science           89
English           74
Social Science    86
Art and Craft     68
dtype: int64

**3. To check whether Maths marks are 73**

In [19]:
# check whether Maths marks are 73
marks_series.Maths == 73

False

In [20]:
# or you may use
marks_series["Maths"] == 73

False

**4. To create a series by generating numpy random numbers**

In [6]:
# create the numbers series 
# generate a sequence of 15 random numbers using random()
# round(): rounds off the number to nearest integer
num = pd.Series(np.random.random(15)*10).round()
num

0      2.0
1      9.0
2      9.0
3      9.0
4      0.0
5     10.0
6      3.0
7      3.0
8      5.0
9      8.0
10    10.0
11     7.0
12     8.0
13     8.0
14     0.0
dtype: float64

**5. To find the square of the numbers series**

In [22]:
# declare a variable square 
# variable 'square' contains the 
square = pd.Series(num*num)
square.index = [num]
square

6.0      36.0
8.0      64.0
9.0      81.0
6.0      36.0
10.0    100.0
9.0      81.0
7.0      49.0
8.0      64.0
3.0       9.0
0.0       0.0
4.0      16.0
3.0       9.0
4.0      16.0
7.0      49.0
9.0      81.0
dtype: float64

**6. To assign index name and object name**

In [23]:
# assign object name
square.name = 'Square'

# assign index name
square.index.name = 'Number'

# print the series
square

6.0      36.0
8.0      64.0
9.0      81.0
6.0      36.0
10.0    100.0
9.0      81.0
7.0      49.0
8.0      64.0
3.0       9.0
0.0       0.0
4.0      16.0
3.0       9.0
4.0      16.0
7.0      49.0
9.0      81.0
Name: Square, dtype: float64

**Note:** A Series’s index can be altered in-place by assignment

From the output, it is not clear that the index column is labeled, to check whether it is labeled let us print it

We use `Series.index` to print the index.

In [24]:
# print the index
square.index

MultiIndex([( 6.0,),
            ( 8.0,),
            ( 9.0,),
            ( 6.0,),
            (10.0,),
            ( 9.0,),
            ( 7.0,),
            ( 8.0,),
            ( 3.0,),
            ( 0.0,),
            ( 4.0,),
            ( 3.0,),
            ( 4.0,),
            ( 7.0,),
            ( 9.0,)],
           name='Number')

From `name='Number'`, it is seen that the column is labeled.

**7. Add a number 5 to every element of the series**

In [25]:
# add 5 to each element
square + 5

6.0      41.0
8.0      69.0
9.0      86.0
6.0      41.0
10.0    105.0
9.0      86.0
7.0      54.0
8.0      69.0
3.0      14.0
0.0       5.0
4.0      21.0
3.0      14.0
4.0      21.0
7.0      54.0
9.0      86.0
Name: Square, dtype: float64

The number 5 has been added to all the variables.

**8. To extract a value specifying the index**

In [26]:
# obtain the value having index 7
square[7]

7.0    49.0
7.0    49.0
Name: Square, dtype: float64

All values with index as '7' are obtained

**9. To extract a range of values specifying the location**

In [27]:
# extract values having the location 
# obtain the values from the 3rd position till the 6th position
# indexing starts from 0
# the 7th position is not included
square[3:7]

6.0      36.0
10.0    100.0
9.0      81.0
7.0      49.0
Name: Square, dtype: float64

**10. Usage of `.iloc`**

We use `.iloc` to get the values of the specified index of numbers 

In [28]:
# obtain the value in the 5th position
square.iloc[5]

81.0

In [29]:
# obtain the values from the 3rd till the 8th position
# indexing starts from 0
# the 7th position is not included
square.iloc[3:9]

6.0      36.0
10.0    100.0
9.0      81.0
7.0      49.0
8.0      64.0
3.0       9.0
Name: Square, dtype: float64

**11. Sorting a numeric series**

In [30]:
# create a pandas series
age = pd.Series([23, 45, np.nan, 41, 23, 34, 55, np.nan, 34, 20])

# print values
age

0    23.0
1    45.0
2     NaN
3    41.0
4    23.0
5    34.0
6    55.0
7     NaN
8    34.0
9    20.0
dtype: float64

In [31]:
# ascending order
# sort_values: sorts the values
# ascending : if specified as True, it sorts values in ascending order (default value is True)
age.sort_values(ascending = True)

9    20.0
0    23.0
4    23.0
5    34.0
8    34.0
3    41.0
1    45.0
6    55.0
2     NaN
7     NaN
dtype: float64

In [32]:
# arrange in descending order
# sort_values: sorts the values
# ascending : if specified as True, it sorts values in ascending order (default value is True)
# set ascending to False to sort the values in ascending order
age.sort_values(ascending = False)

6    55.0
1    45.0
3    41.0
5    34.0
8    34.0
0    23.0
4    23.0
9    20.0
2     NaN
7     NaN
dtype: float64

**12. Sorting a categorical series**

In [33]:
# create a pandas series
string_values = pd.Series(["a", "j", "d", "f", "t", "a"])

# print the values
string_values

0    a
1    j
2    d
3    f
4    t
5    a
dtype: object

In [34]:
# ascending order
# sort_values: sorts the values
# ascending : if specified as True, it sorts values in ascending order (default value is True)
string_values.sort_values(ascending = True)

0    a
5    a
2    d
3    f
1    j
4    t
dtype: object

In [35]:
# descending order
# sort_values: sorts the values
# ascending : if specified as True, it sorts values in ascending order (default value is True)
# set ascending to False to sort the values in ascending order
string_values.sort_values(ascending = False)

4    t
1    j
3    f
2    d
0    a
5    a
dtype: object

**13. Sorting based on index**

In [36]:
# recall the series 'square'
square

6.0      36.0
8.0      64.0
9.0      81.0
6.0      36.0
10.0    100.0
9.0      81.0
7.0      49.0
8.0      64.0
3.0       9.0
0.0       0.0
4.0      16.0
3.0       9.0
4.0      16.0
7.0      49.0
9.0      81.0
Name: Square, dtype: float64

In [37]:
# sort in ascending order based on index
# sort_index: sorts the series based on the index
# ascending : if specified as True, it sorts values in ascending order (default value is True)
square.sort_index(ascending = True)

0.0       0.0
3.0       9.0
3.0       9.0
4.0      16.0
4.0      16.0
6.0      36.0
6.0      36.0
7.0      49.0
7.0      49.0
8.0      64.0
8.0      64.0
9.0      81.0
9.0      81.0
9.0      81.0
10.0    100.0
Name: Square, dtype: float64

In [38]:
# sort in descending order based on index
# sort_index: sorts the series based on the index
# ascending : if specified as True, it sorts values in ascending order (default value is True)
# set ascending to False to sort the values in ascending order
square.sort_index(ascending = False)

10.0    100.0
9.0      81.0
9.0      81.0
9.0      81.0
8.0      64.0
8.0      64.0
7.0      49.0
7.0      49.0
6.0      36.0
6.0      36.0
4.0      16.0
4.0      16.0
3.0       9.0
3.0       9.0
0.0       0.0
Name: Square, dtype: float64

**14. Rank a Series**

In [39]:
# recall the marks_series
marks_series

Maths             60
Science           89
English           74
Social Science    86
Art and Craft     68
dtype: int64

In [40]:
# rank the marks in ascending order
# rank(): ranks the values of a series 
marks_series.rank()

Maths             1.0
Science           5.0
English           3.0
Social Science    4.0
Art and Craft     2.0
dtype: float64

In [41]:
# rank the marks in ascending order
# rank(): ranks the values of a series 
# ascending : if specified as True, it sorts values in ascending order (default value is True)
# set ascending to False to sort the values in ascending order
marks_series.rank(ascending = False)

Maths             5.0
Science           1.0
English           3.0
Social Science    2.0
Art and Craft     4.0
dtype: float64

In [42]:
# USER :


list1=[20,30,10,30,40,50,40]
name=['A','B','C','D','E','F','G']
s1=pd.Series(list1,index=name)

In [43]:
s1.rank()                                 ## Bydefault is avg | Ranking is ascending

A    2.0
B    3.5
C    1.0
D    3.5
E    5.5
F    7.0
G    5.5
dtype: float64

In [44]:
s1.rank(method='min',ascending=True)     ## Ranking is ascending | B and D gets 3 as rank as both have 30 as value
                                         #  and rank 4 is skipped as method is 'min'.

A    2.0
B    3.0
C    1.0
D    3.0
E    5.0
F    7.0
G    5.0
dtype: float64

In [45]:
s1.rank(method='min',ascending=False)   ## Ranking is descending 

A    6.0
B    4.0
C    7.0
D    4.0
E    2.0
F    1.0
G    2.0
dtype: float64

In [46]:
s1.rank(method='max')                   ## B and D gets 4 as rank as both have 30 as value and rank 3 is skipped as 
                                        # method is 'max'.

A    2.0
B    4.0
C    1.0
D    4.0
E    6.0
F    7.0
G    6.0
dtype: float64

In [47]:
s1.rank(method='first')                # Gives different rank even if value are same.

A    2.0
B    3.0
C    1.0
D    4.0
E    5.0
F    7.0
G    6.0
dtype: float64

In [48]:
s1.rank(method='dense')                 # Gives same rank if value are same for given index.

A    2.0
B    3.0
C    1.0
D    3.0
E    4.0
F    5.0
G    4.0
dtype: float64

In [49]:
# Manipulating series:
list1=[10,20,np.nan,30,np.nan]         # np.nan puts Nan value.
s2=pd.Series(list1)
s2

0    10.0
1    20.0
2     NaN
3    30.0
4     NaN
dtype: float64

In [50]:
# Checking is null or not :

print(s2.isnull(),'\n')        #True - Nan 
print(s2.notnull())            #False- Nan

0    False
1    False
2     True
3    False
4     True
dtype: bool 

0     True
1     True
2    False
3     True
4    False
dtype: bool


<a id="dataframes"> </a>
## 4. Pandas DataFrames

 A DataFrame is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (such as numeric, string, boolean). <br><br>
                        The DataFrame has both row and column index; it can be thought of as a dict of Series all sharing the same index. In a data frame, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. <br><br>
                        While a DataFrame is physically two-dimensional, it can be used to represent higher dimensional data in a tabular format using hierarchical indexing.
                   

<a id="creatingDF"> </a>
### 4.1 Creating DataFrames

**1. Creating a data frame a dictionary**

In [51]:
# create a dictionary
data = {'Subject': ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'],
        'Marks': (45, 65, 78, 65, 80, 78),
        'GPA': [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]}

# create the dataframe using DataFrame()
df_marks = pd.DataFrame(data)

# print the dataframe
print(df_marks)

    Subject  Marks  GPA
0     Maths     45  2.5
1   History     65  3.0
2   Science     78  3.5
3   English     65  2.0
4  Georaphy     80  4.0
5       Art     78  4.0


**Note:** Like Series, the resulting DataFrame is assigned index automatically. And the 'Marks' values are in a tuple. 

**2. Another way to create dataframe from dictionary**

In [52]:
# creare the dictionary
data = [{'Subject': 'Maths', 'Marks': 45, 'GPA':2.5},
        {'Subject':'History', 'Marks':65, 'GPA':3.0},
        {'Subject':'Science', 'Marks':78, 'GPA':3.5},
        {'Subject':'English', 'Marks':65, 'GPA':2.0},
        {'Subject':'Georaphy', 'Marks':80, 'GPA':4.0},
        {'Subject':'Art', 'Marks':78, 'GPA':4.0}]

# create the dataframe using DataFrame()
df_marks = pd.DataFrame(data)

# print the dataframe
print(df_marks)

    Subject  Marks  GPA
0     Maths     45  2.5
1   History     65  3.0
2   Science     78  3.5
3   English     65  2.0
4  Georaphy     80  4.0
5       Art     78  4.0


**3. To create dataframe from lists**

In [53]:
# declare the list 'Subject'
Subject = ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art']

# declare the list 'Marks'
Marks = [45, 65, 78, 65, 80, 78]

# declare the list 'CPA'
GPA = [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]

In [54]:
# create a DataFrame from the lists
# index: specifies the index names
df_marks = pd.DataFrame([Subject, Marks, GPA], index = ['Subject','Marks','GPA'])

# print the DataFrame
df_marks

Unnamed: 0,0,1,2,3,4,5
Subject,Maths,History,Science,English,Georaphy,Art
Marks,45,65,78,65,80,78
GPA,2.5,3.0,3.5,2.0,4.0,4.0


However to want a vertical dataframe so we use `.T`. The 'T' stands for transpose.

In [55]:
# transpose the DataFrame
df_marks.T

Unnamed: 0,Subject,Marks,GPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**4. To create dataframe from series**

In [56]:
# declare the series 'Subject'
Subject = pd.Series(['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'])

# declare the series 'Marks'
Marks = pd.Series([45, 65, 78, 65, 80, 78])

# declare the series 'GPA'
GPA = pd.Series([2.5, 3.0, 3.5, 2.0, 4.0, 4.0])

In [57]:
# create a DataFrame from the Series
# index: specifies the index names
df_marks = pd.DataFrame([Subject, Marks, GPA], index = ['Subject','Marks','GPA'])

# print the DataFrame
df_marks

Unnamed: 0,0,1,2,3,4,5
Subject,Maths,History,Science,English,Georaphy,Art
Marks,45,65,78,65,80,78
GPA,2.5,3.0,3.5,2.0,4.0,4.0


However to want a vertical dataframe so we use `.T`. The 'T' stands for transpose.

In [58]:
# transpose the DataFrame
df_marks.T

Unnamed: 0,Subject,Marks,GPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**Remark:** Assign a name to the data frame and then use `.T` to transpose it.

**5. To read data from csv file**

In [59]:
df_basic_info = pd.read_csv("basic_info.csv")

In [60]:
type(df_basic_info)

pandas.core.frame.DataFrame

On checking the data type, we notice it is read as pandas data frame.

In [61]:
print(df_basic_info)

    Age  Weight (in kg)  Height (in m)
0    45              60           1.35
1    12              43           1.21
2    54              78           1.50
3    26              65           1.21
4    68              50           1.32
5    21              43           1.52
6    10              32           1.65
7    57              34           1.61
8    75              23           1.24
9    32              21           1.52
10   23              53           1.50
11   34              65           1.76
12   55              89           1.65
13   23              45           1.75
14   56              76           1.69
15   67              78           1.85
16   26              65           1.21
17   56              74           1.69
18   67              78           1.85
19   26              65           1.21
20   68              50           1.32
21   56              76           1.69
22   67              78           1.85


**6. To print head of the data**

In [62]:
# display the first 5 rows of the DataFrame
# head(): displays the first 5 rows
# to display more rows, for example 15, use head(15)
# the default value is 5
df_basic_info.head()

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32


By default, the `.head()` will display **first** five rows. However, we can set the desired number of rows to be displayed.

Say we want to see the first 9 rows, we write the number 9 in the parentheses.

In [63]:
# display 9 rows
df_basic_info.head(9)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32
5,21,43,1.52
6,10,32,1.65
7,57,34,1.61
8,75,23,1.24


**7. To print tail of the data**

In [64]:
# to display the last 5 rows
df_basic_info.tail()

Unnamed: 0,Age,Weight (in kg),Height (in m)
18,67,78,1.85
19,26,65,1.21
20,68,50,1.32
21,56,76,1.69
22,67,78,1.85


By default, the `.tail()` will display **last** five rows. However, we can set the desired number of rows to be displayed.

Say we want to see the last 14 rows, we write the number 14 in the parentheses.

In [65]:
# to display the last 14 rows
df_basic_info.tail(14)

Unnamed: 0,Age,Weight (in kg),Height (in m)
9,32,21,1.52
10,23,53,1.5
11,34,65,1.76
12,55,89,1.65
13,23,45,1.75
14,56,76,1.69
15,67,78,1.85
16,26,65,1.21
17,56,74,1.69
18,67,78,1.85


**8. To obtain the dimension of the data**

In [66]:
# to display the shape of the data
df_basic_info.shape

(23, 3)

**9. To know the data types of a data frame**

In [67]:
# to know the type of each variable
df_basic_info.dtypes

Age                 int64
Weight (in kg)      int64
Height (in m)     float64
dtype: object

We see the data type of each variable.

**10. To know some information of the data**

In [68]:
# to know information on the variables in the data
df_basic_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             23 non-null     int64  
 1   Weight (in kg)  23 non-null     int64  
 2   Height (in m)   23 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 680.0 bytes


We see this output gives the number of rows present in the data `RangeIndex: 23 entries, 0 to 22` There are 23 rows numbered from 0 to 22. And there are a total of three columns - `Data columns (total 3 columns)`. 

Consider `Age 23 non-null int64` indicates that the column named 'Age' has 23 non-null observations having the data type 'int64'

And finally the memory used to save this dataframe is 680 bytes.

**11. To check the data type of column in the data frame**

In [69]:
# check the type of each variable
type(df_basic_info.Age)

pandas.core.series.Series

In [70]:
# check the type of each variable
type(df_basic_info["Weight (in kg)"])

pandas.core.series.Series

In [71]:
# check the type of each variable
type(df_basic_info["Height (in m)"])

pandas.core.series.Series

In [72]:
# Describe :
df1.describe()

NameError: name 'df1' is not defined

In [None]:
df1.describe(include='all')

# -------------------------------------------------------------------------------------------
# USER :

In [None]:
# DataFrame from list in list :

list1=[["S1",20],["S2",80],["S3",70]]
data1=pd.DataFrame(list1,columns=["Name","Marks"])
data1

In [None]:
list1=[["S1",20],["S2",80],["S3",70]]
data1=pd.DataFrame(list1,columns=["Name","Marks"],index=["row1",'row2','row3'])
data1

In [None]:
Subject=['ABC','DEF','GHI']
Marks=[10,23,20]
CGPA=[3,5,4]
data1=pd.DataFrame([Subject,Marks,CGPA],index=['Subject','Marks','CGPA'])
print(data1,'\n')
print(data1.T)

In [None]:
# DataFrame from dictionary :

dict1={"Name":("S4","S5","S6"),
       "Marks":(45,56,44),
       "CGPA":(3,4,3)}
data2=pd.DataFrame(dict1)
data2

In [None]:
dict2=[{"Subject":"A","Marks":50,"CGPA":4},
       {"Subject":"B","Marks":60,"CGPA":3},
       {"Subject":"C","Marks":40,"CGPA":4},
       {"Subject":"D","Marks":55,"CGPA":5},
       {"Subject":"E","Marks":65,"CGPA":7}]
data3=pd.DataFrame(dict2)
print(data3)

In [None]:
#                                              Reading FILES :

In [None]:
#pd.read_csv('file')
df1=pd.read_csv('basic_info.csv')
df1

In [None]:
#html:
df2=pd.read_html('employee.html',header=1,index_col=0)
df2

In [None]:
#text file:
df3=pd.read_csv('demography.txt',delimiter='\t')
df3

In [None]:
# excel file:
dict4=pd.read_excel('employee_info.xlsx',sheet_name=0)         # 0 - reads the first sheet or "sheet1"
dict4

In [None]:
# xml file:
dict5=pd.read_xml('student_data.xml')
dict5

In [None]:
df11=pd.read_csv('basic_info.csv')
df11.to_json('jnew.json')
df11.to_html('hnew.html')
df11.to_excel('enew.xlsx')

In [None]:
# Seaborn Library : preinstalled datasets available for practice
import seaborn as sns
sns.get_dataset_names()

In [None]:
# get "flight" dataset
fly=sns.load_dataset('flights')
print(fly)

**Note that every column of the data frame is a pandas Series.**

<a id="manipulatingDF"> </a>
### 4.2  Manipulating DataFrames 

### Add new column and rows

 **Remark:**   
 1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.<br>
 2. New columns cannot be created with the ` df_basic_info.BMI ` syntax.
                   

**1. Adding a new column to the data set**

In [None]:
# create a new variable BMI
df_basic_info["BMI"] = df_basic_info["Weight (in kg)"] / df_basic_info["Height (in m)"]**2

In [None]:
# print the DataFrame
df_basic_info.head()

In [None]:
# check the shape of the DataFrame
df_basic_info.shape

**2. Adding a new row to the data set**

A new row can be added using the `loc`

In [None]:
# add a new row at the end of the DataFrame
df_basic_info.loc[24] = [45, 85, 1.8, 26.3]

In [None]:
# display the DataFrame
df_basic_info

We see that a new column number 23 has be added to the data.

**3. Indexing a dataframe using `.iloc`**

`DataFrame.iloc[]` method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. 

We shall work on the BMI data set.

#### Select the 2nd row

In [None]:
# select the second row
df_basic_info.iloc[2]

In [None]:
df_basic_info.iloc[0]

#### Select fourth, seventh and tenth rows

In [None]:
# select the 4th, 7th and the 10th row
df_basic_info.iloc[[4,7,10]]

We use two square brackets since we are passing a list of row numbers to be accessed.

#### Select 12th to 17th rows

In [None]:
# select rows 
df_basic_info.iloc[12:18]

#### Select the 1st column

In [None]:
# select the 1st column
df_basic_info.iloc[:, 1]

#### Select the last column

In [None]:
# select the last column
# the colon indicates that all the rows are selected
# -1 indicated that the last column is selected
df_basic_info.iloc[:,-1]

To select the last column we use -1, to select the second last column we use -2

#### Select the first two columns

In [None]:
# select the 1st and 2nd columns
# the colon indicates that all the rows are selected
# 0:2 indicates that the 1st and 2nd columns are selected
df_basic_info.iloc[:,0:2]

#### Select the first two columns and 5 to 10 rows

In [None]:
# 5:11 indicates that the 5th to 10th rows will be selected
# 0:2 indicates that the 1st and 2nd columns be selected
df_basic_info.iloc[5:11, 0:2]

**4. Indexing a dataframe using `.loc`**

`DataFrame.loc[]` method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame. <br>
`DataFrame.loc[Row_names, column_names]` is used to select or index rows or columns based on their name.

#### Select 1 to 5 rows and 2nd and 4th columns

In [None]:
# 1:5 indicates that rows with indices 1, 2, 3, 4 and 5 are selected
# ["Weight (in kg)","BMI"] indicates that the specified columns be selected
df_basic_info.loc[1:5,["Weight (in kg)","BMI"]]

**Note:** the row names are numbers 

**5. Selecting columns by specifying column names**

#### Select the column 'Age'

In [None]:
# select the coumn 'Age'
df_basic_info.Age

**Remark:** Using this method we can select only one column.

In [None]:
# OR
df_basic_info["Age"]

#### Select the column 'Age' and 'BMI'

In [None]:
# select two columns
# the column names are passed in a list
df_basic_info[["Age","BMI"]]

**6. Conditional subsetting**

Selecting rows where the value of `Age` is greater than 47 

In [None]:
# to selct rows where the Age is greater than 47
df_basic_info[df_basic_info['Age'] > 47]

Subsetting the age more than 40 *or* weight column value more than 65

In [None]:
# to select rows where either age is greater than 40 or weight is more than 65
df_basic_info[ (df_basic_info["Age"] > 40) | (df_basic_info['Weight (in kg)'] > 65)]

Subsetting the age *and* weight column value more than 65

In [None]:
# select rows where both age and weight are more than 65
df_basic_info[(df_basic_info["Age"] > 65) & (df_basic_info['Weight (in kg)'] > 65)]

**7. Sort the data frame on the basis of values in a column**

Each column of a pandas DataFrame is treated as a pandas Series. The `.sort_values()` in DataFrames works similar to the `pandas.Series`.


In [None]:
# sort the data frame on basis of 'Age' values
# by default the values will get sorted in ascending order
df_basic_info.sort_values('Age')
# Note: 'ascending = False' will sort the data frame in descending order

In [None]:
df_basic_info.sort_values(by=['Age','Weight (in kg)'],ascending=[True,False])
# Age in ascending and Weight is in descending.

In [None]:
df_basic_info.sort_index()

**8. Rank the dataframe**

In [None]:
# rank the data frame 'data' in descending order based on 'BMI'
# 'method = min' assigns the minimum rank to highest equal value of 'BMI' 
df_basic_info['Rank_min'] = df_basic_info['BMI'].rank(ascending = 0, method  = 'min')

# print the data
df_basic_info

In [None]:
df_basic_info["Rank on age"]=df_basic_info["Age"].rank(method='min')
df_basic_info

From the above data frame, we can see that 'BMI = 44.395875' is repeating thrice; thus the method = 'min' will assign the minimum rank (=1) to all the three values of BMI. The rank '4' will be assigned to the second largest value of BMI and so on. Thus, there is no rank equal to 2 and 3.

In [None]:
# method = 'dense' assigns same rank to all the same BMI values
df_basic_info['Rank_densed'] = df_basic_info['BMI'].rank(method = 'dense')

# print the data
df_basic_info

Here, the dense method assigns minimum rank (=1) to the minimum value (=9.089335) of the BMI. Rank 2 will be assigned to next smallest value and so on. The value 44.395875 which repeats thrice has the same rank - 18 to the three observations.

<a id="reading_data"> </a>
## 5.  Reading Data from Different Sources

**1. Read a `.xlsx` file**

In [None]:
# to read a xlsx file
pd.read_excel('employee_info.xlsx')

**2. Read a `.zip` file**

The zipped file contain a .csv file

In [None]:
# import the library zipfile
import zipfile

# Zipfile reads the zipped file
# from the zipped file open the csv as f
# read the csv file
with zipfile.ZipFile('data.zip') as z:
    with z.open('example.csv') as f:
        file = pd.read_csv(f)
        print(file.head())

**3. Read a `.html` file**

In [None]:
# read the html using read_html()
# header = 1 indicates that the first row contains the headings
# index_col = 0 indicates that the first column contains the index
df_emp = pd.read_html('employee.html', header=1, index_col=0)

# print the data
df_emp

**4. Read a `.txt` file**

In [None]:
# read the text file
# sep is the delimiter use
df_demography = pd.read_csv('demography.txt', sep="\t")

# print the head of the data
df_demography.head()

**5. Read a `.json` file**

In [None]:
# read the file
pd.read_json('iris.json')

**6. Read a `.xml` file**

In [None]:
# import the library to xml file
import xml.etree.ElementTree as ET 

# extract the xml file
tree = ET.parse("student_data.xml")
root = tree.getroot() 

# assign column names of the output DataFrame
df_col = ["Name", "Gender", "Marks"]

# create an empty list 
rows = []

for node in root: 
    name = node.attrib.get("name")
    gender = node.find("gender").text if node is not None else None
    marks = node.find("marks").text if node is not None else None
    
    # append each observation in the data to ‘rows’
    rows.append({"Name": name, "Gender": gender, 
                 "Marks": marks})

# create a DataFrame    
df_student = pd.DataFrame(rows, columns = df_col)

# print the DataFrame
df_student

# Direct Acess:
    df1[col_label_only]    or df1.col_label
    df1[[list of col_label_only]]
    df1[row_labels_slice_only]

In [None]:
######. df1[row_labels_slice_only]
# df[0]---- will show error
# df[0:8]--- will display with index 0 to 7

# .iloc: 
integer based excess.

In [None]:
#To Check the dataset :
df23=pd.read_csv("basic_info_missingdata.csv")
df23

In [None]:
df23.isnull().sum()      ## tell how many null value are present in each columns

In [None]:
df23.fillna(value=10)   ## Filled all the Nan value with 10

In [None]:
df23.fillna(method='ffill')   # next value before Nan value get filled in Nan

In [None]:
df23.fillna(method='bfill') # Next value after Nan get filled in Nan

In [None]:
# Put mean value of Age in Nan :
meann=df23["Age"].mean()
df23.fillna(value=meann)


In [None]:
# Drop Nan value :
df23.dropna()

In [None]:

df23.dropna(how="any")

In [None]:
df23.dropna(how="all")