In [None]:
# importing the libraries
import pandas as pd               # for data manipulation
import numpy as np                # for mathematical operations

In [None]:
# reading the raw dataset
df = pd.read_csv("Raw_Data.csv", index_col=0)

# displaying the dataset
df.head()

In [None]:
# displaying the info of the dataframe
df.info()

There are missing values present in many of the columns. The datatypes of a few of the numerical columns like "Number of balls bowled", "runs given", etc. are also incorrect. 

Before that, few columns which do not add any information to the dataset or are redundant are removed.

In [None]:
# removing the columns
df.drop(['Variable Tag', 'Name', 'Unnamed: 25', 'matches played', 'innings played.1'], axis=1, inplace=True)

The unique values in all the columns are displayed to get a detailed look of the values stored in them individually.

In [None]:
# iterating over the column names
for col in df.columns:
    # printing the column name
    print(col)
    # printing the unique values in that specified column
    print(df[col].unique())

A few observations from the above output are:
1. The highest score has an asterisk(*) at the end of some of the scores. This means that the batsman stayed not out at the end of that inning. Since a column 'no' is present which stores the number of matches in which the batsman stayed not out, this column can be cleaned by removing the asterisk and converting to a numeric datatype.
2. The symbol '-' is used along with 'nan' to represent null values in many of the features like 'Number of balls bowled', 'runs given' etc. The '-' needs to be replaced with Python understandable null values. 
3. The columns 'Best inning bowling' and 'best match bowling' store impossible and incorrect dates and hence, need to be removed. 
4. The column '10 wickets haul' stores 0, '-' and nan, or no useful information, and can be removed.

All these issues are corrected in the next few steps.

In [None]:
# replacing all the '-' with np.nan
df.replace('-', np.nan, inplace=True)

In [None]:
# no. of missing values in each of the columns
df.isnull().sum()

In [None]:
# filling all the missing values with 0
df.fillna(0, inplace=True)

In [None]:
# removing all the unnecessary columns
df.drop(['10 wicket hauls', 'best match bowling', 'Best innings bowling'], axis=1, inplace=True)

In [None]:
# unique values of the 'highest score' column
df['highest score'].unique()

In [None]:
# removing the asterisk and converting the datatype
df['highest score'] = [int(score.strip('*')) for score in df['highest score']]

In [None]:
# unique values of the 'highest score' column
df['highest score'].unique()

In [None]:
# numerical variables which have incorrect datatype
var = ['Number of balls bowled', 'runs given', 'wkts taken', 'Bowling econ', 'sr', '4w', '5w']

In [None]:
# converting the datatype of the above variables
df[var] = df[var].astype('float')

In [None]:
# saving the cleaned dataset
df.to_csv("Clean_Data.csv", index=False)