# **Topic 03 Part 1:DataFrame Functions**

PandasDataFrame.append(),apply(),aggregate(),assign(), astype(),count(),cut(),describe(),drop_duplicates(),groupby()

**1.1:Pandas DataFrame.append()**

The append() function is a method provided by the pandas library in Python for appending rows or columns to a DataFrame. It allows you to combine two or more DataFrames vertically or horizontally.

**The syntax for using the append() function is as follows:**

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)

**Let's break down the different parameters:**

**other:** Specifies the DataFrame or Series object to append to the original DataFrame.

**ignore_index: **Determines whether to ignore the index of the original DataFrame and generate a new index for the resulting DataFrame. By default, it is set to False, meaning the original index values are preserved.

**verify_integrity:** Determines whether to check for potential duplicate indices when appending. If set to True and duplicates are found, a ValueError is raised. By default, it is set to False.

**sort:** Determines whether to sort the resulting DataFrame by column names. By default, it is set to None, meaning no sorting is performed.

**Example 1:**

In [2]:
import pandas as pd

# Create first DataFrame
info1 = {'x': [23, 45, 21, 87], 'y': [55, 77, 33, 88]}
df1 = pd.DataFrame(info1, index=['a', 'b', 'c', 'd'])
print("Create first DataFrame using dictionary")
print(df1)

# Add column 'z' to df1
df1['z'] = [5, 6, 7, 8]
print(df1)

# Create second DataFrame
info2 = {'x': [43, 55, 3, 4], 'y': [22, 34, 22, 13]}
df2 = pd.DataFrame(info2)

# Add column 'z' to df2
df2['z'] = [54, 6, 4, 3]
print(df2)

append1 = pd.concat([df1, df2], ignore_index=True)
print(append1)


Create first DataFrame using dictionary
    x   y
a  23  55
b  45  77
c  21  33
d  87  88
    x   y  z
a  23  55  5
b  45  77  6
c  21  33  7
d  87  88  8
    x   y   z
0  43  22  54
1  55  34   6
2   3  22   4
3   4  13   3
    x   y   z
0  23  55   5
1  45  77   6
2  21  33   7
3  87  88   8
4  43  22  54
5  55  34   6
6   3  22   4
7   4  13   3


In [4]:
import pandas as pd

print("First DataFrame:")
data1 = {
    'name': ['a', 'b', 'c', 'd', 'e', 'f'],  # Fixed column name
    'number': [1, 2, 3, 4, 5, 6],
    'Age': [12, 24, 56, 33, 2, 22]
}
df1 = pd.DataFrame(data1, index=[1, 2, 3, 4, 5, 6])
print(df1)

print("\nSecond DataFrame:")
data2 = {
    'name': ['g', 'h', 'i', 'j', 'k', 'l'],
    'number': [32, 43, 22, 54, 12, 99],
    'Age': [34, 55, 90, 43, 11, 22]
}
df2 = pd.DataFrame(data2, index=[1, 2, 3, 4, 5, 6])
print(df2)

print("\nApplying Concat:")

df_combined = pd.concat([df1, df2], ignore_index=True)
print(df_combined)


First DataFrame:
  name  number  Age
1    a       1   12
2    b       2   24
3    c       3   56
4    d       4   33
5    e       5    2
6    f       6   22

Second DataFrame:
  name  number  Age
1    g      32   34
2    h      43   55
3    i      22   90
4    j      54   43
5    k      12   11
6    l      99   22

Applying Concat:
   name  number  Age
0     a       1   12
1     b       2   24
2     c       3   56
3     d       4   33
4     e       5    2
5     f       6   22
6     g      32   34
7     h      43   55
8     i      22   90
9     j      54   43
10    k      12   11
11    l      99   22


# **1.2:Pandas DataFrame.apply()**

The Pandas library in Python provides a powerful data manipulation tool called DataFrame. Among its numerous functions, one particularly useful method is DataFrame.apply(). This method allows you to apply a function to each element or row/column of a DataFrame, enabling efficient data transformations and computations.

**The syntax for using DataFrame.apply() is as follows:**

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)

**Let's break down the parameters:**

**func**: This parameter specifies the function that you want to apply to each element or row/column of the DataFrame.

**axis:** It determines the axis along which the function will be applied. By default, it is set to 0, which means the function is applied to each column (vertically). You can also set it to 1 to apply the function to each row (horizontally).

**raw:** If set to True, the function will receive ndarray objects as input instead of Series/DataFrame.

**result_type:** It specifies the type of the resulting DataFrame after applying the function. The default value of None returns a DataFrame, but you can set it to 'reduce' to obtain a Series if the function reduces the dimensionality.

**args and kwargs:** These are additional arguments and keyword arguments that can be passed to the function specified in func. The function specified in func can be a built-in Python function, a lambda function, or a custom-defined function. It will be applied to each element or row/column, depending on the axis parameter. The result will be a new DataFrame or Series, depending on the result_type parameter.

**DataFrame.apply()** is particularly useful when you need to perform element-wise computations, such as applying mathematical operations, transforming data, or even combining multiple columns to create new columns based on a custom logic.

By leveraging the power of DataFrame.apply(), you can efficiently process and manipulate large datasets in Pandas, making it an essential tool for data analysis and manipulation in Python.

**Example**

In [5]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[7,4]] * 4 ,columns=['A','B'])
print(df)
print("Sum of Columns")
info=df.apply(np.sum, axis =0)
print(info)
print("Sum of Rows")
info1=df.apply(np.sum, axis = 1)
print(info1)

   A  B
0  7  4
1  7  4
2  7  4
3  7  4
Sum of Columns
A    28
B    16
dtype: int64
Sum of Rows
0    11
1    11
2    11
3    11
dtype: int64



**Example 03:Applying a Function to Each Element by using Sum:**

In [6]:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
print("sum by lambda function by columns")
sum2= lambda x:np.sum(x , axis=0)
aa=df.apply(sum2)
print(aa)


   A  B
0  1  4
1  2  5
2  3  6
sum by lambda function by columns
A     6
B    15
dtype: int64


**Example 02:Applying a Function to Each Element by using Multiply:**

In [7]:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
print('use lambda function')
# Define a lambda function to multiply each element by 2
multiple_by_2= lambda x: np.multiply(x ,2)
# Apply the function to each element in the DataFrame
apply =df.apply(multiple_by_2)
print(apply)

   A  B
0  1  4
1  2  5
2  3  6
use lambda function
   A   B
0  2   8
1  4  10
2  6  12


In [8]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 7, 8, 9, 10]})
info=df*2
info

Unnamed: 0,a,b
0,2,12
1,4,14
2,6,16
3,8,18
4,10,20


**Example 04:Applying a Function to Each Element by using Dividing:**

In [9]:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Define a lambda function to multiply each element by 2
Divide_by_2 = lambda x: np.divide(x,2)

# Apply the function to each element in the DataFrame
result = df.apply(Divide_by_2)
print(result)

     A    B
0  0.5  2.0
1  1.0  2.5
2  1.5  3.0


In [10]:
import pandas as pd
import numpy as np

#Create a DataFrame
data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

result = data / 2
print(result)

     A    B
0  0.5  2.0
1  1.0  2.5
2  1.5  3.0


# **1.3:Pandas DataFrame.aggregate()**

**The DataFrame.aggregate()**method in pandas is used to apply aggregation functions to one or more columns of a DataFrame. It allows you to calculate various summary statistics, such as sum, mean, maximum, minimum, etc., for specific columns or the entire DataFrame.

**The basic syntax for using aggregate() is:**

DataFrame.aggregate(func, axis=0, args, *kwargs)

**func:** It can be a single aggregation function or a list of aggregation functions to apply.

**axis:** It specifies the axis along which the aggregation is performed. By default, axis=0 performs the aggregation vertically (column-wise).

**args and kwargs:** Additional arguments that can be passed to the aggregation function.

**Example 01**

In [11]:
import pandas as pd
import numpy as np
dat =pd.DataFrame([[1,2,3],[4,5,6],[545,66,223],[np.nan,np.nan,np.nan]], columns=['X','Y','Z'])
print(dat)
print('use aggregate() function')
Ag =dat.agg(['sum' ,'min'])
Ag

       X     Y      Z
0    1.0   2.0    3.0
1    4.0   5.0    6.0
2  545.0  66.0  223.0
3    NaN   NaN    NaN
use aggregate() function


Unnamed: 0,X,Y,Z
sum,550.0,73.0,232.0
min,1.0,2.0,3.0


In [13]:
import pandas as pd

# Create a sample DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'income': [50000, 60000, 70000, 80000]
}

df = pd.DataFrame(data)
print(df)

print("\nApplying aggregate() function on numeric columns:")
# Apply aggregate only on numeric columns
numeric_df = df.select_dtypes(include=['number'])  # Select only numeric columns
ag = numeric_df.agg(['mean', 'median', 'max', 'min', 'sum', 'std', 'var'])  
print(ag)


      name  age  income
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    David   40   80000

Applying aggregate() function on numeric columns:
               age        income
mean     32.500000  6.500000e+04
median   32.500000  6.500000e+04
max      40.000000  8.000000e+04
min      25.000000  5.000000e+04
sum     130.000000  2.600000e+05
std       6.454972  1.290994e+04
var      41.666667  1.666667e+08


In [14]:
import pandas as pd

# Create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],'age': [25, 30, 35, 40],'income': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
print(df)
# Apply aggregate function to calculate multiple statistics for 'age' and 'income' columns
aggg= df[['age','name']].agg(['mean', 'median','max','min','sum','std','var','corrcoef','ptp'])
aggg

      name  age  income
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    David   40   80000


TypeError: Could not convert string 'AliceBobCharlieDavid' to numeric

In [20]:
import pandas as pd
csv = pd.read_csv('data.csv')

csv.drop(columns='NAME', inplace=True)
csv

Unnamed: 0,ID,AMOUNT,PERCENTAGE
0,2394,23869,75
1,9350,16753,61
2,9855,17301,74
3,9453,6898,95
4,6650,14926,81
5,2789,30620,82
6,7933,11290,89
7,3115,27368,79
8,5919,10160,90
9,6890,22438,82


In [21]:
data = csv.agg(['min','max','mean','var','corrcoef','std','median','sum'])
data

Unnamed: 0,ID,AMOUNT,PERCENTAGE
min,2131.0,3951.0,60.0
max,9947.0,38801.0,96.0
mean,6789.96,20184.62,79.08
var,6015180.0,93601750.0,106.605714
corrcoef,1.0,1.0,1.0
std,2452.586,9674.8,10.325004
median,7740.5,18879.5,79.5
sum,339498.0,1009231.0,3954.0


In [27]:
import pandas as pd
import numpy as np
df= pd.read_csv('data.csv')

df.drop(columns='NAME', inplace=True)

df

Unnamed: 0,ID,AMOUNT,PERCENTAGE
0,2394,23869,75
1,9350,16753,61
2,9855,17301,74
3,9453,6898,95
4,6650,14926,81
5,2789,30620,82
6,7933,11290,89
7,3115,27368,79
8,5919,10160,90
9,6890,22438,82


In [28]:
#use apply function
data =df.apply(np.sum, axis= 0)
data

ID             339498
AMOUNT        1009231
PERCENTAGE       3954
dtype: int64

In [29]:
da =df[['AMOUNT','PERCENTAGE']]*2
da

Unnamed: 0,AMOUNT,PERCENTAGE
0,47738,150
1,33506,122
2,34602,148
3,13796,190
4,29852,162
5,61240,164
6,22580,178
7,54736,158
8,20320,180
9,44876,164


In [31]:
da =csv[['AMOUNT','PERCENTAGE']]*2
da

Unnamed: 0,AMOUNT,PERCENTAGE
0,47738,150
1,33506,122
2,34602,148
3,13796,190
4,29852,162
5,61240,164
6,22580,178
7,54736,158
8,20320,180
9,44876,164


In [33]:
import pandas as pd

# Sample DataFrame (assuming df is already defined)
df = pd.DataFrame({'Country': ['USA', 'Canada'], 'Age': [25, 30], 'Salary': [50000, 60000]})

# New DataFrame to append
data2 = pd.DataFrame({'Country': ['Dubai', 'Dubai', 'Jampur'], 'Age': [11, 12, 13], 'Salary': [2300, 2400, 2500]})

print("Append using concat")

# Correct way using concat()
data_combined = pd.concat([df, data2], ignore_index=True)

print(data_combined)


Append using concat
  Country  Age  Salary
0     USA   25   50000
1  Canada   30   60000
2   Dubai   11    2300
3   Dubai   12    2400
4  Jampur   13    2500


# **1.4:Pandas DataFrame.assign():**

**The DataFrame.assign()**method in pandas is used to create new columns or overwrite existing columns in a DataFrame. It allows you to assign computed values or transform existing columns based on specified conditions.

**The basic syntax for using assign() is:**

DataFrame.assign(**kwargs)

**kwargs**: The keyword arguments represent column names and the corresponding values or expressions to assign to those columns. The column names are specified as keyword arguments, and the values or expressions are provided as the values.

**Example 01**

In [34]:
import numpy as np
import pandas as pd

de = {'a1':[1,2,3,4,5],'a2':[6,7,8,9,0]}
df0 =pd.DataFrame(de)
print(df0)
ass=df0.assign(a3 =[43,45,32,55,66] ,a4=[435,343,523,542,553] )
print(ass)
#ase = df0.assign(a4=[435,343,523,542,553])
#ase

   a1  a2
0   1   6
1   2   7
2   3   8
3   4   9
4   5   0
   a1  a2  a3   a4
0   1   6  43  435
1   2   7  45  343
2   3   8  32  523
3   4   9  55  542
4   5   0  66  553


In [35]:
#add column a1 and a2 in a3
add=df0.assign(a3 = df0['a1']+df0['a2'])
print(add)

   a1  a2  a3
0   1   6   7
1   2   7   9
2   3   8  11
3   4   9  13
4   5   0   5


In [36]:
import pandas as pd
import numpy as np

csv = pd.read_csv ("Dta.csv")
csv

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [37]:
# add colummn in csv file
aadd=csv.assign(percentage=[23,454,32,52,25,54,75,34,23,66])
aadd

Unnamed: 0,Country,Age,Salary,Purchased,percentage
0,France,44.0,72000.0,No,23
1,Spain,27.0,48000.0,Yes,454
2,Germany,30.0,54000.0,No,32
3,Spain,38.0,61000.0,No,52
4,Germany,40.0,,Yes,25
5,France,35.0,58000.0,Yes,54
6,Spain,,52000.0,No,75
7,France,48.0,79000.0,Yes,34
8,Germany,50.0,83000.0,No,23
9,France,37.0,67000.0,Yes,66


In [38]:
# two column add in one column
addd= csv.assign(percentage = csv['Age']+ csv['Salary'])
addd

Unnamed: 0,Country,Age,Salary,Purchased,percentage
0,France,44.0,72000.0,No,72044.0
1,Spain,27.0,48000.0,Yes,48027.0
2,Germany,30.0,54000.0,No,54030.0
3,Spain,38.0,61000.0,No,61038.0
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,58035.0
6,Spain,,52000.0,No,
7,France,48.0,79000.0,Yes,79048.0
8,Germany,50.0,83000.0,No,83050.0
9,France,37.0,67000.0,Yes,67037.0


**Example 1:**

In [39]:
import pandas as pd
data =pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print("View the dataframe")
print(data)
print("Assign a new column to dataframe called 'age'")
data1= data.assign(Name = [8, 9, 10])
print(data1)

View the dataframe
   A  B
0  1  4
1  2  5
2  3  6
Assign a new column to dataframe called 'age'
   A  B  Name
0  1  4     8
1  2  5     9
2  3  6    10


In [40]:
import pandas as pd
data =pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print("View the dataframe")
print(data)
print("Assign a new column to dataframe called 'age'")
data['Name']=[3,4,5]
print(data)

View the dataframe
   A  B
0  1  4
1  2  5
2  3  6
Assign a new column to dataframe called 'age'
   A  B  Name
0  1  4     3
1  2  5     4
2  3  6     5


**Example 02:Transforming an existing column:**

In [41]:
import pandas as pd
print("Create a DataFrame")
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
print()
print("Multiply values in column 'A' by 2")
df_new = df.assign(A=df['A'] * 2)
print(df_new)

Create a DataFrame
   A  B
0  1  4
1  2  5
2  3  6

Multiply values in column 'A' by 2
   A  B
0  2  4
1  4  5
2  6  6


**Example 03: Creating a new column based on multiple existing columns:**

In [42]:
import pandas as pd
print("Create a DataFrame")
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
print()
print("Create a new column 'C' by adding values in columns 'A' and 'B'")
df_new = df.assign(C=df['A'] + df['B'])
print(df_new)

Create a DataFrame
   A  B
0  1  4
1  2  5
2  3  6

Create a new column 'C' by adding values in columns 'A' and 'B'
   A  B  C
0  1  4  5
1  2  5  7
2  3  6  9


**Example 04: Creating multiple new columns simultaneously:**

In [43]:
import pandas as pd
print("Create a DataFrame")
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
print()
print("Create new columns 'C' and 'D' simultaneously")
df_new = df.assign(C=df['A'] + df['B'], D=df['A'] * df['B'])
print(df_new)

Create a DataFrame
   A  B
0  1  4
1  2  5
2  3  6

Create new columns 'C' and 'D' simultaneously
   A  B  C   D
0  1  4  5   4
1  2  5  7  10
2  3  6  9  18


**Example 05: Chaining multiple assign statements:**

In [44]:
import pandas as pd
print("Create a DataFrame")
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
print()
print(" Chain multiple assign statements to create new columns")
df_new = df.assign(C=10, D=df['A'] + df['B'] , E=df['A'] * 2, F=df['A']/df['B'])
print(df_new)
print()
print("OR")
df_new = df.assign(C=10).assign(D=df['A'] + df['B']).assign(E=df['A'] * 2).assign(F=df['A']/df['B'])
print(df_new)


Create a DataFrame
   A  B
0  1  4
1  2  5
2  3  6

 Chain multiple assign statements to create new columns
   A  B   C  D  E     F
0  1  4  10  5  2  0.25
1  2  5  10  7  4  0.40
2  3  6  10  9  6  0.50

OR
   A  B   C  D  E     F
0  1  4  10  5  2  0.25
1  2  5  10  7  4  0.40
2  3  6  10  9  6  0.50


# **1.5:Pandas DataFrame.astype()**
The **astype()** method in Pandas DataFrame is used to change the data type of one or more columns in a DataFrame. It allows you to explicitly convert the data type of the values within the DataFrame to a specified type.

**The syntax for using the astype() method is as follows:**

DataFrame.astype(dtype, copy=True, errors='raise')

**The parameters for the astype() method are:**

**dtype:** This parameter specifies the data type to which the column(s) should be converted. It can be a data type name (e.g., 'int', 'float', 'str') or a NumPy data type object.

**copy (optional):** By default, it is set to True, which creates a new copy of the DataFrame with the converted data types. If set to False, it modifies the original DataFrame in-place.

**errors (optional):** This parameter determines how the method handles errors during the conversion. By default, it is set to 'raise', which raises an exception if there is an error. Other options include 'ignore', which ignores the errors and leaves the original values unchanged, and 'coerce', which replaces the invalid values with NaN.

In [45]:
import pandas as pd
df = pd.DataFrame({'A':[1.2,2.3,4.5],'B':[8.4,9.4,3.2]})
print(df)
gf = df.astype(int)
print(gf)

     A    B
0  1.2  8.4
1  2.3  9.4
2  4.5  3.2
   A  B
0  1  8
1  2  9
2  4  3


In [46]:
import pandas as pd
df = pd.DataFrame({'A':[1.2,2.3,4.5],'B':[8.4,9.4,3.2]})
print(df)
gf = df.astype(complex)
print(gf)

     A    B
0  1.2  8.4
1  2.3  9.4
2  4.5  3.2
          A         B
0  1.2+0.0j  8.4+0.0j
1  2.3+0.0j  9.4+0.0j
2  4.5+0.0j  3.2+0.0j


In [47]:
import pandas as pd
print("Create a DataFrame")
data = {'A': [1, 2, 3, 4]}
df = pd.DataFrame(data)
print(df)
print()
print("Convert column 'A' to float")
df = df.astype(float)
print(df)

Create a DataFrame
   A
0  1
1  2
2  3
3  4

Convert column 'A' to float
     A
0  1.0
1  2.0
2  3.0
3  4.0


In [48]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3, 4]}
df = pd.DataFrame(data)

# Convert column 'A' to string
df = df.astype(str)
print(df)

   A
0  1
1  2
2  3
3  4


In [49]:
import pandas as pd
# Create a DataFrame
data = {'A': ['1', '2', '3', 'x']}
df = pd.DataFrame(data)
# Convert column 'A' to integer, ignoring errors
df = df.astype(int, errors='ignore')
print(df)

   A
0  1
1  2
2  3
3  x


# **1.6:Pandas DataFrame.count()**

The **count()** method in Pandas DataFrame is used to count the number of non-null values in each column of a DataFrame. It returns a Series object that contains the counts for each column.

**The syntax for using the count() method is as follows: **

DataFrame.count(axis=0, level=None, numeric_only=False)

**The parameters for the count() method are:**

**axis (optional):** This parameter specifies the axis along which the count operation is performed. By default, it is set to 0, which counts the values vertically (column-wise). If set to 1, it counts the values horizontally (row-wise).

**level (optional):** If the DataFrame has hierarchical indexing (MultiIndex), this parameter specifies the level at which the count operation is performed.

**numeric_only (optional):** By default, it is set to False, which includes all columns for counting. If set to True, it only counts the columns with numeric data types.

**Example**

In [50]:
import numpy as np
import pandas as pd
A =pd.DataFrame({'A' : [12,34,345,12,4,321]})
print(A)
count =A.count()
count

     A
0   12
1   34
2  345
3   12
4    4
5  321


A    6
dtype: int64

In [51]:
count =A.count()
count

A    6
dtype: int64

In [53]:
import pandas as pd
A = pd.read_csv('Dta.csv')
A

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [54]:
C = A.count()
C

Country      10
Age           9
Salary        9
Purchased    10
dtype: int64

In [55]:
C = A.count(axis=1)
C

0    4
1    4
2    4
3    4
4    3
5    4
6    3
7    4
8    4
9    4
dtype: int64

In [56]:
import pandas as pd
print("Creating a sample DataFrame")
data = {'name':['John', 'Jane', 'Mark', 'Mary', 'Alex'],'age': [25, 30, 20, 35, 28],'gender':['Male', 'Female', 'Male', 'Female', 'Male']}
df = pd.DataFrame(data)
print(df)
print(" Counting non-null values in each column")
counts = df.count()
print(counts)

Creating a sample DataFrame
   name  age  gender
0  John   25    Male
1  Jane   30  Female
2  Mark   20    Male
3  Mary   35  Female
4  Alex   28    Male
 Counting non-null values in each column
name      5
age       5
gender    5
dtype: int64


In [57]:
import pandas as pd

# Create a DataFrame with missing values
data = {'Name': ['John', 'Emma', None],
        'Age': [25, None, 28],
        'Height': [1.75, 1.68, None]}
df = pd.DataFrame(data)
print(df)
print()
# Count non-null values
counts = df.count()
print(counts)

   Name   Age  Height
0  John  25.0    1.75
1  Emma   NaN    1.68
2  None  28.0     NaN

Name      2
Age       2
Height    2
dtype: int64


In [58]:
import pandas as pd
# Create a DataFrame with missing values
data = {'Name': ['John', 'Emma', None],'Age': [25, None, 28],'Height': [1.75, 1.68, None]}
df = pd.DataFrame(data)
print(df)
print()
row_counts = df.count(axis=1)
print("Display the row counts")
print(row_counts)
print()
colum_counts = df.count(axis=0)
print("Display the colum counts")
print(colum_counts)

   Name   Age  Height
0  John  25.0    1.75
1  Emma   NaN    1.68
2  None  28.0     NaN

Display the row counts
0    3
1    2
2    1
dtype: int64

Display the colum counts
Name      2
Age       2
Height    2
dtype: int64


# **1.7: Pandas DataFrame.cut()**
Pandas is a popular open-source library in Python used for data manipulation and analysis. It provides various data structures, including the DataFrame, which is a two-dimensional tabular data structure.

One useful feature in Pandas is the ability to categorize and bin numerical data using the cut() function. The cut() function allows you to divide a continuous variable into discrete intervals or bins, based on specified criteria. It is often used to convert continuous data into categorical data, making it easier to analyze and interpret.

**The syntax for using the cut() function in Pandas is as follows:**

pandas.cut(x, bins, labels=None, right=True, include_lowest=False, duplicates='raise')

**Here is an explanation of the parameters:**

x: This is the input array or Series containing the values to be binned.

**bins:** This can be an integer, specifying the number of equal-width bins to create, or a sequence of bin edges, defining the bin ranges.

**labels:** This parameter is optional. It can be used to provide custom labels for the resulting bins. If not specified, integer labels will be assigned to the bins.

**right:** This is a boolean parameter indicating whether the intervals should be right-closed (includes the right bin edge) or left-closed (excludes the right bin edge).

**include_lowest:** This is a boolean parameter indicating whether to include the lowest bin edge in the intervals.

**duplicates:** This parameter specifies how to handle duplicate bin edges if they exist. It can take the values 'raise', 'drop', or 'raise'. The cut() function returns a new categorical object, which can be assigned to a column in a DataFrame or used for further analysis.

**Example 1:** Categorizing Exam Scores

Suppose you have a DataFrame df with a column 'Score' containing exam scores. You want to categorize the scores into three bins: 'Low', 'Medium', and 'High'. You can use the cut() function as follows:

In [59]:
import pandas as pd
df = pd.DataFrame({'Score': [85, 60, 75, 90, 95, 50, 70]})
bins = [0, 60, 80, 100]
labels = ['Low', 'Medium', 'High']

df['Score_Category']= pd.cut(df['Score'], bins=bins, labels=labels)

print(df)

   Score Score_Category
0     85           High
1     60            Low
2     75         Medium
3     90           High
4     95           High
5     50            Low
6     70         Medium


In [60]:
import pandas as pd
df = pd.DataFrame({'Age': [12, 45, 67, 34,13, 19, 52, 8]})
bins = [0, 12, 18, 60, 100]
labels = ['Child', 'Teen', 'Adult', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df)

   Age Age_Group
0   12     Child
1   45     Adult
2   67    Senior
3   34     Adult
4   13      Teen
5   19     Adult
6   52     Adult
7    8     Child


In [62]:
import pandas as pd

# Define DataFrame
df = pd.DataFrame({'Age':[32,53,65,84,32,43,23,76,45,67]})

# Define bins and labels
bins = [0, 14, 19, 60, 80, 100]  # 5 bins
labels = ['Child', 'Teen', 'Adult', 'Senior', 'Elderly']  # 5 labels

# Apply pd.cut()
df['group_age'] = pd.cut(df['Age'], bins=bins, labels=labels)

# Print DataFrame
print(df)


   Age group_age
0   32     Adult
1   53     Adult
2   65    Senior
3   84   Elderly
4   32     Adult
5   43     Adult
6   23     Adult
7   76    Senior
8   45     Adult
9   67    Senior


In [63]:
import pandas as pd

df = pd.DataFrame({'Age': [32, 53, 65, 84, 32, 43, 23, 76, 45, 67]})
bins = [0, 14, 19, 60, 80, 100]
bins = sorted(bins)
labels = ['child', 'Teen', 'Adult', 'Senior','oldest']

df['group_age'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df)

   Age group_age
0   32     Adult
1   53     Adult
2   65    Senior
3   84    oldest
4   32     Adult
5   43     Adult
6   23     Adult
7   76    Senior
8   45     Adult
9   67    Senior


In [64]:
import pandas as pd

df = pd.DataFrame({'Age': [32, 53, 65, 84, 32, 43, 23, 76, 45, 67]})

bins = [0, 14, 19, 60, 80, 100]

# Sort bins
bins = sorted(bins)

# Add one more label
labels = ['child', 'Teen', 'Adult', 'Senior', 'Oldest']

df['group_age'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df)

   Age group_age
0   32     Adult
1   53     Adult
2   65    Senior
3   84    Oldest
4   32     Adult
5   43     Adult
6   23     Adult
7   76    Senior
8   45     Adult
9   67    Senior


# **1.8:Pandas DataFrame.describe()**

The Pandas library is a popular data manipulation and analysis tool in Python. It provides various data structures, such as the DataFrame, which is a two-dimensional table-like data structure. The DataFrame.describe() method is used to generate descriptive statistics of the columns in a DataFrame.

**The syntax for using the DataFrame.describe() method is as follows:**

DataFrame.describe(percentiles=None, include=None, exclude=None)

**Here is an explanation of the parameters:**

**percentiles:** This parameter is optional and allows you to specify which percentiles to include in the output. By default, it includes the 25th, 50th, and 75th percentiles.

**include:** This parameter is optional and allows you to specify the data types to include in the output. You can pass a list of data types or use 'all' to include all columns.

**exclude:** This parameter is optional and allows you to specify the data types to exclude from the output. You can pass a list of data types to exclude from the result.

When you call the DataFrame.describe() method, it computes various summary statistics for each numerical column in the DataFrame, including count, mean, standard deviation, minimum, quartiles, and maximum. For categorical columns, it provides count, unique values, top value, and frequency of the top value.

The DataFrame.describe() method returns a new DataFrame that contains the descriptive statistics for the columns. It is useful for gaining insights into the distribution and summary of your data.

In [65]:
import pandas as pd
A =pd.Series([1,2,3,4,5,6,7,8,9])
print(A)
a1=A.describe()
a1

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64


count    9.000000
mean     5.000000
std      2.738613
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max      9.000000
dtype: float64

In [66]:
import pandas as pd
a1 = pd.Series(['p', 'q', 'q', 'r'])
print(a1)
a2= a1.describe()
print(a2)

0    p
1    q
2    q
3    r
dtype: object
count     4
unique    3
top       q
freq      2
dtype: object


In [67]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['John', 'Emma', 'Michael', 'Sophia'], 'Age': [25, 30, 35, 28], 'Salary': [50000, 60000, 70000, 55000]}
df = pd.DataFrame(data)
print(df)
print()

# Use describe() method with percentiles
description = df.describe(percentiles=[0.1, 0.5, 0.9])

print(description)


      Name  Age  Salary
0     John   25   50000
1     Emma   30   60000
2  Michael   35   70000
3   Sophia   28   55000

             Age        Salary
count   4.000000      4.000000
mean   29.500000  58750.000000
std     4.203173   8539.125638
min    25.000000  50000.000000
10%    25.900000  51500.000000
50%    29.000000  57500.000000
90%    33.500000  67000.000000
max    35.000000  70000.000000


In [70]:
import pandas as pd
df = pd.read_csv('Dta.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [71]:
df1=df.describe()
df1

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


# **1.9:DataFrame.drop_duplicates()**

The **drop_duplicates()** function is a method provided by the pandas library in Python for working with DataFrames. It is used to remove duplicate rows from a DataFrame, returning a new DataFrame with the duplicates removed.

**The syntax for using the drop_duplicates() function is as follows:**

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

**Let's break down the different parameters:**

**subset (optional):** Specifies the column(s) or label(s) to consider when identifying duplicates. By default, it considers all columns.

**keep (optional):** Specifies which occurrence of a duplicate to keep. It accepts three possible values:

**'first' (default):** Keeps the first occurrence of each duplicated row and removes subsequent occurrences.

**'last':** Keeps the last occurrence of each duplicated row and removes previous occurrences.

**False:** Removes all occurrences of duplicated rows.

inplace (optional): Determines whether to modify the original DataFrame or return a new DataFrame with duplicates dropped. By default, it creates and returns a new DataFrame without modifying the original.

In [74]:
import pandas as pd
df = pd.read_csv('Dta.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [75]:
A= df.drop_duplicates()
A

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [76]:
A = {'A':[3,54,2,3,2,54,77,77,44,88,88,]}
B= pd.DataFrame(A)
C=B.drop_duplicates()
C

Unnamed: 0,A
0,3
1,54
2,2
6,77
8,44
9,88


In [77]:
import pandas as pd
data = {"Name": ["Parker", "Smith", "William", "Parker"], "Age": [21, 32, 29, 21]}
info = pd.DataFrame(data)
print(info)
print()
print(" Removing duplicate values from the DataFrame")
info = info.drop_duplicates()
print(info)

      Name  Age
0   Parker   21
1    Smith   32
2  William   29
3   Parker   21

 Removing duplicate values from the DataFrame
      Name  Age
0   Parker   21
1    Smith   32
2  William   29


# **1.10:Pandas DataFrame.groupby()**

The **groupby()** function is a method provided by the pandas library in Python for grouping and aggregating(جمع کرنا\) data in a DataFrame. It allows you to split the data into groups based on one or more columns and perform operations on each group.

**The syntax for using the groupby() function is as follows:**

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)

**Let's break down the different parameters:**

**by:** Specifies the column(s) or label(s) to group by. It can be a single column name or a list of column names.

**axis:** Specifies the axis to group along. By default, it is set to 0, which groups along the rows. You can set it to 1 to group along the columns.

**level:** Specifies the level(s) (hierarchical index) to group by in case of a MultiIndex DataFrame.

**as_index:** Determines whether to use the grouping columns as the index of the resulting DataFrame. By default, it is set to True.

**sort:** Determines whether to sort the resulting DataFrame by the grouping columns. By default, it is set to True.

**group_keys:** Determines whether to include the grouping keys in the resulting DataFrame index. By default, it is set to True.

**squeeze:** Determines whether to squeeze the grouped DataFrame if possible. By default, it is set to False.

**observed:** Determines whether to only group by observed values of categorical/grouping variables. By default, it is set to False.

In [78]:
# import the pandas library
import pandas as pd
import numpy as np
data = {'Name': ['Parker', 'Smith', 'John', 'William'],  'Percentage': [82, 98, 91, 87], 'Course': ['B.Sc','B.Ed','M.Phill','BA']}
df = pd.DataFrame(data)
print(df)
print()
print("groupby object is created")
grp = df.groupby('Course')
print(grp)

      Name  Percentage   Course
0   Parker          82     B.Sc
1    Smith          98     B.Ed
2     John          91  M.Phill
3  William          87       BA

groupby object is created
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A0CC4D9090>


In [79]:
grp['Percentage'].agg(np.mean)

  grp['Percentage'].agg(np.mean)


Course
B.Ed       98.0
B.Sc       82.0
BA         87.0
M.Phill    91.0
Name: Percentage, dtype: float64

In [80]:
import pandas as pd
import numpy as np
data = {'Name': ['Parker', 'Smith', 'John', 'William'],
   'Percentage': [82, 98, 91, 87],
   'Course': ['B.Sc','B.Ed','M.Phill','BA']}
df = pd.DataFrame(data)
grouped = df.groupby('Course')
print(grouped['Percentage'].mean())


Course
B.Ed       98.0
B.Sc       82.0
BA         87.0
M.Phill    91.0
Name: Percentage, dtype: float64


In [81]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],'Age': [25, 30, 35, 40, 45],'Salary': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)
# Group by the 'Name' column and calculate the average salary
group =df.groupby('Age')
avg=group['Salary'].mean()
avg

Age
25    50000.0
30    60000.0
35    70000.0
40    80000.0
45    90000.0
Name: Salary, dtype: float64

**Groupby Function()**

In [82]:
import pandas as pd
data =pd.DataFrame({'Name':['a','b','c','b','c','d','a','d'],'S_1':[2,3,4,5,6,7,8,9],'S_2':[4,6,8,2,9,10,66,55]})
print(data)

  Name  S_1  S_2
0    a    2    4
1    b    3    6
2    c    4    8
3    b    5    2
4    c    6    9
5    d    7   10
6    a    8   66
7    d    9   55


In [83]:
gr = data.groupby('Name')
gr

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A0CC4E7F90>

In [84]:
for x,y in gr:
  print(x)
  print(y)
  print()

a
  Name  S_1  S_2
0    a    2    4
6    a    8   66

b
  Name  S_1  S_2
1    b    3    6
3    b    5    2

c
  Name  S_1  S_2
2    c    4    8
4    c    6    9

d
  Name  S_1  S_2
5    d    7   10
7    d    9   55



In [85]:
gr.get_group('c')

Unnamed: 0,Name,S_1,S_2
2,c,4,8
4,c,6,9


In [86]:
gr.get_group('b')

Unnamed: 0,Name,S_1,S_2
1,b,3,6
3,b,5,2


In [87]:
gr.max()

Unnamed: 0_level_0,S_1,S_2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,8,66
b,5,6
c,6,9
d,9,55


In [88]:
gr.min()

Unnamed: 0_level_0,S_1,S_2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,4
b,3,2
c,4,8
d,7,10


In [89]:
gr.sum()

Unnamed: 0_level_0,S_1,S_2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,10,70
b,8,8
c,10,17
d,16,65


In [90]:
gr.mean()

Unnamed: 0_level_0,S_1,S_2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,5.0,35.0
b,4.0,4.0
c,5.0,8.5
d,8.0,32.5


In [91]:
gr.var()

Unnamed: 0_level_0,S_1,S_2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,18.0,1922.0
b,2.0,8.0
c,2.0,0.5
d,2.0,1012.5


In [92]:
gr.median()

Unnamed: 0_level_0,S_1,S_2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,5.0,35.0
b,4.0,4.0
c,5.0,8.5
d,8.0,32.5


**Head() function**

In [97]:
import pandas as pd
df = pd.read_csv('Dta.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [98]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [99]:
df.head(6)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes


**Tail() Function**

In [100]:
df.tail()

Unnamed: 0,Country,Age,Salary,Purchased
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [101]:
df.tail(7)

Unnamed: 0,Country,Age,Salary,Purchased
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


**df.info()**

In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


**isnull() Function**

In [103]:
df.isnull()

Unnamed: 0,Country,Age,Salary,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,True,False
5,False,False,False,False
6,False,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


In [104]:
df.notnull()

Unnamed: 0,Country,Age,Salary,Purchased
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,False,True
5,True,True,True,True
6,True,False,True,True
7,True,True,True,True
8,True,True,True,True
9,True,True,True,True


**df.isnull().sum()**

In [105]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

## Follow My Github Account : https://github.com/ZeshanFareed