<h2>Overview</h2>
<p>The <code>clean_and_preprocess_data</code> function is designed to clean and preprocess data from a CSV file. It takes a file path as input, reads the data, and performs various operations to clean and transform the data.</p>

<h2>Parameters</h2>
<ul>
	<li><code>file_path</code>: The path to the CSV file containing the data.</li>
</ul>

<h2>Returns</h2>
<p>A pandas DataFrame containing the cleaned and preprocessed data.</p>

<h2>Operations Performed</h2>
<ol>
	<li><strong>Read Data</strong>: Reads the data from the specified CSV file using <code>pd.read_csv</code>.</li>
	<li><strong>Drop Unnecessary Columns</strong>: Drops the <code>Unnamed: 0</code> column, which is assumed to be unnecessary.</li>
	<li><strong>Remove Missing Values</strong>: Uses <code>KNNImputer</code> to impute missing values in the <code>Rating</code> column and then drops any remaining rows with missing values.</li>
	<li><strong>Convert Columns to Numeric Values</strong>:
		<ul>
			<li>Converts the <code>Installs</code> column to numeric values by removing commas and plus signs.</li>
			<li>Converts the <code>Size</code> column to numeric values by replacing "Varies with device" with 0 and removing "M" and "k" suffixes.</li>
			<li>Converts the <code>Price</code> column to numeric values by removing dollar signs.</li>
			<li>Converts the <code>Rating</code> column to numeric values.</li>
			<li>Converts the <code>Reviews</code> column to numeric values.</li>
		</ul>
	</li>
	<li><strong>Convert Last Updated Column to Datetime Format</strong>: Converts the <code>Last Updated</code> column to datetime format using <code>pd.to_datetime</code>.</li>
	<li><strong>Remove Duplicate Rows</strong>: Removes duplicate rows based on the <code>id</code> column, keeping only the first occurrence.</li>
	<li><strong>Reset Index</strong>: Resets the index of the DataFrame.</li>
</ol>

<h2>How to use:</h2>
<pre><code>file_path = '/content/googleplaystore(impure).csv'
data = clean_and_preprocess_data(file_path)
print(data.shape)</code></pre>

</body> </html>



<h2>Notes</h2>
<ul>
  <li>This function assumes that the input CSV file has a specific structure and column names.</li>
  <li>The KNNImputer is used to impute missing values in the Rating column, having more than 13% missing values and important for our analysis.</li>
  <li>The function performs various operations to clean and transform the data. These operations may need to be adjusted depending on the specific requirements of the project.</li>
</ul>

</body>
</html>

In [111]:
import pandas as pd
import numpy as np
import datetime
from sklearn.impute import KNNImputer

def clean_and_preprocess_data(file_path):
    # Read the data from the file
    data = pd.read_csv(file_path, index_col=False)

    #Dropping unnecessary columns
    data.drop(columns='Unnamed: 0',axis=1, inplace=True)

    # Remove rows with missing values
    imputer = KNNImputer(n_neighbors=5)
    data[['Rating']] = imputer.fit_transform(data[['Rating']])
    data.dropna(inplace=True)

    # Convert 'downloads' column to numeric values
    data['Installs'] = data['Installs'].str.replace(',', '').str.replace('+','').astype(float)

    # Convert 'size' column to numeric values
    data['Size'] = data['Size'].str.replace('Varies with device', '0')
    data['Size'] = data['Size'].str.replace('M', '').str.replace('k', '').astype(float)

    # Convert 'price' column to numeric values
    data['Price'] = data['Price'].str.replace('$', '').astype(float)

    # Convert 'rating' column to numeric values
    data['Rating'] = data['Rating'].astype(float)

    # Convert 'reviews' column to numeric values
    data['Reviews'] = data['Reviews'].astype(int)

    # Convert 'last_updated' column to datetime format
    data['Last Updated']=  pd.to_datetime(data['Last Updated'],errors='coerce')
    data['Last Updated'] = data['Last Updated'].dt.strftime('%Y-%m-%d')
    data['Last Updated']=  pd.to_datetime(data['Last Updated'],errors='coerce')

    # Remove rows with duplicate 'id'
    data.drop_duplicates(keep='first', inplace=True)

    # Reset the index of the DataFrame
    data.reset_index(drop=True, inplace=True)

    # rename columns
    data.columns = [col.lower().replace(' ', '_') for col in data.columns]

    return data


file_path = '/content/googleplaystore(impure).csv'
data = clean_and_preprocess_data(file_path)
print(playstore_cleaned.shape)

(10346, 13)


  data['Last Updated']=  pd.to_datetime(data['Last Updated'],errors='coerce')


In [116]:
#export cleaned data to csv file
data.to_csv('playstore_clean.csv', index=False)


In [115]:
data.sample(5)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
5267,DEER HUNTER 2018,GAME,4.3,955614,82.0,10000000.0,Free,0.0,Teen,Action,2018-06-26,5.1.2,3.0 and up
1228,Calorie Counter by FatSecret,HEALTH_AND_FITNESS,4.4,229210,0.0,10000000.0,Free,0.0,Everyone,Health & Fitness,2018-07-31,Varies with device,Varies with device
5404,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000.0,Free,0.0,Mature 17+,Simulation,2018-05-31,2.1,4.0 and up
6857,Weather Data CH,WEATHER,4.193338,15,0.0,500.0,Paid,2.99,Everyone,Weather,2016-08-09,Varies with device,Varies with device
8669,Peggle Blast,GAME,4.1,166251,19.0,5000000.0,Free,0.0,Everyone,Card,2017-12-11,2.16.0,4.0.3 and up
