# Data Handling:

In [1]:
from tqdm import tqdm
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import time
import random
import csv

In [2]:
df = pd.read_csv('Data Frame (After Setp 1).csv')
df2 = df.copy()
df2

Unnamed: 0,Name,Country,English Level,Price,Diploma,Certificate,Response Time,No Of Lessons,Stars,Reviews
0,James W.,United States of America,A2,126.0,,,Usually responds in 5 hrs,6473,4.9,65.0
1,Taylor T.,United States of America,A1,52.0,,,Usually responds in more than a day,560,5.0,8.0
2,Desmond A.,Ghana,Native,45.0,Diploma verified,,Usually responds in 1 hour,10328,4.8,117.0
3,Joy L.,United States of America,Native,74.0,Diploma verified,Certificate verified,Usually responds in 1 hour,423,5.0,6.0
4,Noelle S.,United States of America,Native,37.0,,Certificate verified,Usually responds in 9 hrs,2,,
...,...,...,...,...,...,...,...,...,...,...
15178,Dewald Jaco D.,South Africa,Native,19.0,,Certificate verified,,,,
15179,Sunday B.,Nigeria,Native,11.0,,,,,,
15180,Ernestas P.,Lithuania,C2,69.0,Diploma verified,,,,,
15181,Ana M.,Brazil,C2,56.0,,,,,,


In [3]:
df2.duplicated().sum()

3621

During a preliminary examination of the dataframe, several issues have been identified within the columns. Specifically, there are 3621 rows that appear to be duplicated or replicated. This indicates the presence of redundant data that needs to be addressed. It is essential to resolve these problems to ensure data accuracy and integrity.

In [4]:
df2 = df2.drop_duplicates()
df2.duplicated().sum()

0

In [5]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11562 entries, 0 to 15177
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           11561 non-null  object 
 1   Country        11561 non-null  object 
 2   English Level  11561 non-null  object 
 3   Price          11561 non-null  float64
 4   Diploma        5025 non-null   object 
 5   Certificate    7441 non-null   object 
 6   Response Time  8636 non-null   object 
 7   No Of Lessons  8932 non-null   object 
 8   Stars          7308 non-null   float64
 9   Reviews        7308 non-null   float64
dtypes: float64(3), object(7)
memory usage: 993.6+ KB


After careful consideration, we have made the decision to update the parameters within the "Diploma" and "Certificate" columns of the dataset. Our new approach involves labeling any teacher who possesses the respective qualification with 'Yes', while indicating 'No' for those who do not meet the criteria.

In [6]:
df2['Diploma'] = df2['Diploma'].replace(np.nan,'No',inplace=False)
df2['Diploma'] = df2['Diploma'].replace('Diploma verified','Yes',inplace=False)
df2['Certificate'] = df2['Certificate'].replace(np.nan,'No',inplace=False)
df2['Certificate'] = df2['Certificate'].replace('Certificate verified','Yes',inplace=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Diploma'] = df2['Diploma'].replace(np.nan,'No',inplace=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Diploma'] = df2['Diploma'].replace('Diploma verified','Yes',inplace=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Certificate'] = df2['Certificate'].replace(np.nan,'No

Upon observation of the "Response Time" column, we have observed that certain teachers provide specific time values, while others express approximate times using textual descriptions. Recognizing the presence of identifiable patterns within this column, we have undertaken the task of converting these textual representations into numerical values.

In [7]:
df2['Response Time'] = df2['Response Time'].replace('Usually responds in less than an hour','1',inplace=False)
df2['Response Time'] = df2['Response Time'].replace('Usually responds in more than a day','24',inplace=False)
df2['Response Time'] = df2['Response Time'].replace(np.nan,'0',inplace=False)
df2['Response Time'] = df2['Response Time'].str.extract(r'(\d+)')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Response Time'] = df2['Response Time'].replace('Usually responds in less than an hour','1',inplace=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Response Time'] = df2['Response Time'].replace('Usually responds in more than a day','24',inplace=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-

We encountered two challenges while processing the dataset. Firstly, certain columns contained comma-separated values, and secondly, some columns still contained 'NaN' entries. To address these issues, we employed the following steps:

1. Handling columns with comma-separated values: We performed data cleansing by removing commas from the affected columns. This allowed us to ensure consistency and facilitate further data manipulation.
2. Managing 'NaN' entries: To ensure data completeness and reliability, we diligently addressed the presence of 'NaN' values by removing them from the dataset.

In [8]:
df2 = df2.dropna()
df2['No Of Lessons'] = df2['No Of Lessons'].astype(str).str.replace(',', '')
df2['No Of Lessons'] = df2['No Of Lessons'].replace(np.nan,'1',inplace=False)

In order to facilitate data analysis and computations, we have successfully converted the following columns to numeric types:

Column 1: 'No Of Lessons'
Column 2: 'Response Time'
Column 3: 'Reviews'
By converting these columns to numeric types, we have transformed the data into a format that allows for numerical calculations and statistical operations, enabling more comprehensive analysis and insights to be derived from the dataset.

In [9]:
df2['No Of Lessons'] = pd.to_numeric(df2['No Of Lessons'])
df2['Response Time'] = pd.to_numeric(df2['Response Time'])
df2['Reviews'] = pd.to_numeric(df2['Reviews'],downcast ='integer')

During the exploratory data analysis (EDA) in step 3, we discovered an anomaly in the ratio between the star ratings, number of reviews, and the overall popularity of teachers. To rectify this issue, we returned to step 2 and introduced a customized logarithmic decimal base calculation.

Specifically, we utilized the formula:
Popularity Score=Stars*log10(Reviews)

This calculation involves taking the logarithm (base 10) of the number of reviews and multiplying it by the corresponding star ratings. The intention behind this calculation is to adjust the impact of reviews based on their logarithmic scale. By applying this transformation, we aim to achieve a more balanced and meaningful representation of a teacher's popularity, accounting for the relative significance of both star ratings and the logarithmically scaled number of reviews.

This approach allows us to consider the overall popularity score as a combined metric that takes into account both star ratings and the logarithmically transformed number of reviews, providing a more accurate assessment of a tutor's popularity within the dataset.

In [10]:
df2['Popularity Score'] = (df2['Stars'] * np.log10( df2['Reviews']))

In [11]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7303 entries, 0 to 15151
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              7303 non-null   object 
 1   Country           7303 non-null   object 
 2   English Level     7303 non-null   object 
 3   Price             7303 non-null   float64
 4   Diploma           7303 non-null   object 
 5   Certificate       7303 non-null   object 
 6   Response Time     7303 non-null   int64  
 7   No Of Lessons     7303 non-null   int64  
 8   Stars             7303 non-null   float64
 9   Reviews           7303 non-null   int16  
 10  Popularity Score  7303 non-null   float64
dtypes: float64(3), int16(1), int64(2), object(5)
memory usage: 641.9+ KB


#### This is our dataset after handling

In [12]:
# df2.to_csv('Data Frame (After Setp 2).csv', index=False)
# df2.to_excel('Data Frame (After Setp 2).xlsx', index=False)