In data science or Machine learning, one of the most important tasks performed when working with large amounts of data is data cleaning. In this assignment, you'll see how to clean a dataset using pandas step by step. This dataset is a simulation of a customer list.

**Q1	What is machine learning? Where and why you will use machine learning?**

Machine learning is a study of algorithms that improve machine's (AI's) performance at executing some task overtime with experience. 
<br>We use machine learning for following reasons:
<br>1. Lack of Human Expertise: Machine learning is essential in scenarios where human expertise is limited or unavailable. For instance, when exploring unknown territories, such as uncharted regions of space or deep ocean exploration, machines can be used to make data-driven decisions in environments where human presence and knowledge are constrained or impractical.
<br>2. Difficulty in Translating Expertise into Computational Tasks: In some domains, humans possess valuable expertise, but translating that knowledge into computational tasks can be challenging. Machine learning can bridge this gap. For example, in the field of speech recognition, where humans have a deep understanding of spoken language, machine learning algorithms can be employed to convert this expertise into practical, computational tasks like voice commands and automated transcription, enabling more natural human-computer interactions.

**Q2 What is normalization/Scaling in Machine Learning and why do you perform? Explain with examples**

Normalization is a common technique used for data preprocessing. Normalization includes modification in the shape of the distribution of the data. Normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. It is performed because it helps to prevent that attributes with large
ranges out-weight attributes with small ranges. Normalization ensures that the features are on a common scale, which can lead to more accurate and efficient machine learning models. The choice between Min-Max scaling and Z-score standardization depends on the characteristics of your data and the requirements of your specific machine learning task. 
<br>Example:

In [30]:
from sklearn import preprocessing
import numpy as np
from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)
#print(california_housing.DESCR)

x_array = np.array(california_housing.data['HouseAge'])
print("HouseAge array: ",x_array)

normalized_arr = preprocessing.normalize([x_array])
print("Normalized HouseAge array: ",normalized_arr)

HouseAge array:  [41. 21. 52. ... 17. 18. 16.]
Normalized HouseAge array:  [[0.00912272 0.00467261 0.01157028 ... 0.00378259 0.0040051  0.00356009]]


The output shows that the normalize() function changed the array of median house age values so that the square root of the sum of the squares of the values equals one. 
In other words, the values were scaled to a unit length using the L2 norm.

**Q3 What is supervised and unsupervised learning? Give some examples** 

Supervised learning algorithms or methods are the most commonly used ML algorithms.This method or learning algorithm take the data sample i.e. the training data and its associated output i.e. labels or responses with each data samples during the training process. The main objective of supervised learning algorithms is to learn an association between input data samples and corresponding outputs after performing multiple training data
instances.
<br>For example, we have
<br>x: Input variables and
<br>Y: Output variable
<br>Now, apply an algorithm to learn the mapping function from the input to output as follows:
<br>Y=f(x)
<br>Now, the main objective would be to approximate the mapping function so well that even when we have new input data (x), we can easily predict the output variable (Y) for that new input data.
<br>It is called supervised because the whole process of learning can be thought as it is being
supervised by a teacher or supervisor. Supervised learning basically tries to model the
relationship between the inputs and their corresponding outputs from the training data so
that we would be able to predict output responses for new data inputs based on the
knowledge it gained earlier with regard to relationships and mappings between the inputs
and their target outputs. This is precisely why supervised learning methods are extensively
used in predictive analytics where the main objective is to predict some response for some
input data that’s typically fed into a trained supervised ML model. Supervised learning
methods are of two major classes based on the type of ML tasks they aim to solve.
<br>•Classification
<br>•Regression
<br>Examples of supervised machine learning algorithms includes Decision tree, Random Forest, KNN, Logistic Regression etc
<br>
<br>
Unsupervised machine learning algorithms we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised
learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data.
<br>For example,: 
<br>Suppose we have:
<br>x: Input variables, then there would be no corresponding output variable and the algorithms need to discover the interesting pattern in data for learning. 
<br>Based on the ML tasks, unsupervised learning algorithms can be divided into following broad classes:
<br>• Clustering
<br>• Association
<br>• Dimensionality Reduction
<br>• Anomaly detection

**Q4 What is Data Cleaning and why do we need it?**  

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. 

We need data cleaning because clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. Benefits include:

Removal of errors when multiple sources of data are at play.
Fewer errors make for happier clients and less-frustrated employees.
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and quicker decision-making.

Data cleaning includes:
Converting the data types if any mismatch present in the data types of the
variables.
Changing the format of the date variable to the required format.
Replacing the special characters and constants with the appropriate values.

In [31]:
import pandas as pd

In [32]:
df_users = pd.DataFrame({
    "user_id": [234, 235, 236, 237, 237, 238, 239, 240, 241, 242, 242],
    "Name": ["Tom", "Alex--", "..Thomas", "John", "John", "Paul/", "Emma9", "Joy", "Samantha_", "Emily", "Emily"],
    "Last_name": ["Smith", "johnson", "brown", "Davis", "Davis", "None", "wilson", "Thompson", "Lee", "Johnson", "Johnson"],
    "age": [23, 32, 45, 22, 22, 50, 34, 47, 28, 19, 19],
    "Phone": ["555/123/4567", "333-234-5678", "444_456_7890", "111-222-3333", "111-222-3333", None, "333/987/4567", "222/345_987", "(777) 987-6543", "777-888-9999", "777-888-9999"],
    "Email": ["smith@email.com", "johnson@hotmail.com", "brown@email.com", "davis@mail.com", "davis@mail.com", "John@gmail.com", "wilson@mail.com", "thompson@email.com", "lee@email.com", "emily@hotmail.com", "emily@hotmail.com"],
    "Not_Useful_column": [None, None, None, None, None, None, None, None, None, None, None]
})

print(df_users)

    user_id       Name Last_name  age           Phone                Email  \
0       234        Tom     Smith   23    555/123/4567      smith@email.com   
1       235     Alex--   johnson   32    333-234-5678  johnson@hotmail.com   
2       236   ..Thomas     brown   45    444_456_7890      brown@email.com   
3       237       John     Davis   22    111-222-3333       davis@mail.com   
4       237       John     Davis   22    111-222-3333       davis@mail.com   
5       238      Paul/      None   50            None       John@gmail.com   
6       239      Emma9    wilson   34    333/987/4567      wilson@mail.com   
7       240        Joy  Thompson   47     222/345_987   thompson@email.com   
8       241  Samantha_       Lee   28  (777) 987-6543        lee@email.com   
9       242      Emily   Johnson   19    777-888-9999    emily@hotmail.com   
10      242      Emily   Johnson   19    777-888-9999    emily@hotmail.com   

   Not_Useful_column  
0               None  
1               N

Here we use the pandas `DataFrame()` function to create a mock dataset, this dataset contains 7 columns and 11 rows, the columns are, a `user_id` which is the user's unique id, a `Name` column, a `Last_name` column, the user's `age`, the user's `Phone` number, the user's `Email`, and finally a non-useful column called `Not_Useful_column` which we will use as an example of how to delete an unnecessary column from a dataset.

As you can see in the example dataset, the data has some inconsistencies in the columns, a few unnecessary symbols in the `Name` column, some values in the `Last_name` column are not capitalized, and each of the values in the `Phone` column have different syntax which makes it difficult to work with them.


Your final output should look like 

In [38]:
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email
0,234,Tom,Smith,23,5551234567,smith@email.com
1,235,Alex,Johnson,32,3332345678,johnson@hotmail.com
2,236,Thomas,Brown,45,4444567890,brown@email.com
3,237,John,Davis,22,1112223333,davis@mail.com
6,239,Emma,Wilson,34,3339874567,wilson@mail.com
8,241,Samantha,Lee,28,7779876543,lee@email.com
9,242,Emily,Johnson,19,7778889999,emily@hotmail.com


In [5]:
import pandas as pd
df_users = pd.DataFrame({
    "user_id": [234, 235, 236, 237, 237, 238, 239, 240, 241, 242, 242],
    "Name": ["Tom", "Alex--", "..Thomas", "John", "John", "Paul/", "Emma9", "Joy", "Samantha_", "Emily", "Emily"],
    "Last_name": ["Smith", "johnson", "brown", "Davis", "Davis", "None", "wilson", "Thompson", "Lee", "Johnson", "Johnson"],
    "age": [23, 32, 45, 22, 22, 50, 34, 47, 28, 19, 19],
    "Phone": ["555/123/4567", "333-234-5678", "444_456_7890", "111-222-3333", "111-222-3333", None, "333/987/4567", "222/345_987", "(777) 987-6543", "777-888-9999", "777-888-9999"],
    "Email": ["smith@email.com", "johnson@hotmail.com", "brown@email.com", "davis@mail.com", "davis@mail.com", "John@gmail.com", "wilson@mail.com", "thompson@email.com", "lee@email.com", "emily@hotmail.com", "emily@hotmail.com"],
    "Not_Useful_column": [None, None, None, None, None, None, None, None, None, None, None]
})

#Removing duplicates
df_users = df_users.drop_duplicates()

#Name column
#Removing unnecessary symbols and non-alphabetic characters.
df_users['Name'] = df_users['Name'].str.replace('[^a-zA-Z ]', '', regex=True)
#The strip() method can be used to remove leading and trailing spaces from a string
df_users['Name'] = df_users['Name'].str.strip()

#Lastname column
#Capitalizing Last_name
df_users['Last_name'] = df_users['Last_name'].str.capitalize()

#Phone column
#Removing unnecessary symbols and non-numeric characters.
df_users['Phone'] = df_users['Phone'].str.replace('[^0-9]', '', regex=True)
#Removing coluumns which have phone number less than 10 digits
df_users = df_users[df_users['Phone'].str.len() == 10]

#Not_Useful_column
#dropping Not_Useful_column
df_users.drop(columns=['Not_Useful_column'], inplace=True)
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email
0,234,Tom,Smith,23,5551234567,smith@email.com
1,235,Alex,Johnson,32,3332345678,johnson@hotmail.com
2,236,Thomas,Brown,45,4444567890,brown@email.com
3,237,John,Davis,22,1112223333,davis@mail.com
6,239,Emma,Wilson,34,3339874567,wilson@mail.com
8,241,Samantha,Lee,28,7779876543,lee@email.com
9,242,Emily,Johnson,19,7778889999,emily@hotmail.com
