<h2>Overview</h2>
<p>The <code>clean_and_preprocess_data</code> function is designed to clean and preprocess data from a CSV file. It takes a file path as input, reads the data, and performs various operations to clean and transform the data.</p>

<h2>Parameters</h2>
<ul>
	<li><code>file_path</code>: The path to the CSV file containing the data.</li>
</ul>

<h2>Returns</h2>
<p>A pandas DataFrame containing the cleaned and preprocessed data.</p>

<h2>Operations Performed</h2>
<ol>
	<li><strong>Read Data</strong>: Reads the data from the specified CSV file using <code>pd.read_csv</code>.</li>
	<li><strong>Drop Unnecessary Columns</strong>: Drops the <code>Unnamed: 0</code> column, which is assumed to be unnecessary.</li>
	<li><strong>Remove Missing Values</strong>: Uses <code>KNNImputer</code> to impute missing values in the <code>Rating</code> column and then drops any remaining rows with missing values.</li>
	<li><strong>Convert Columns to Numeric Values</strong>:
		<ul>
			<li>Converts the <code>Installs</code> column to numeric values by removing commas and plus signs.</li>
			<li>Converts the <code>Size</code> column to numeric values by replacing "Varies with device" with 0 and removing "M" and "k" suffixes.</li>
			<li>Converts the <code>Price</code> column to numeric values by removing dollar signs.</li>
			<li>Converts the <code>Rating</code> column to numeric values.</li>
			<li>Converts the <code>Reviews</code> column to numeric values.</li>
		</ul>
	</li>
	<li><strong>Convert Last Updated Column to Datetime Format</strong>: Converts the <code>Last Updated</code> column to datetime format using <code>pd.to_datetime</code>.</li>
	<li><strong>Remove Duplicate Rows</strong>: Removes duplicate rows based on the <code>id</code> column, keeping only the first occurrence.</li>
	<li><strong>Reset Index</strong>: Resets the index of the DataFrame.</li>
</ol>

<h2>How to use:</h2>
<pre><code>file_path = '/content/googleplaystore(impure).csv'
data = clean_and_preprocess_data(file_path)
print(data.shape)</code></pre>

</body> </html>



<h2>Notes</h2>
<ul>
  <li>This function assumes that the input CSV file has a specific structure and column names.</li>
  <li>The KNNImputer is used to impute missing values in the Rating column, having more than 13% missing values and important for our analysis.</li>
  <li>The function performs various operations to clean and transform the data. These operations may need to be adjusted depending on the specific requirements of the project.</li>
</ul>

</body>
</html>

In [106]:
import pandas as pd
import numpy as np
import datetime
from sklearn.impute import KNNImputer

def clean_and_preprocess_data(file_path):
    # Read the data from the file
    data = pd.read_csv(file_path, index_col=False)

    #Dropping unnecessary columns
    data.drop(columns='Unnamed: 0',axis=1, inplace=True)

    # Remove rows with missing values
    imputer = KNNImputer(n_neighbors=5)
    data[['Rating']] = imputer.fit_transform(data[['Rating']])
    data.dropna(inplace=True)

    # Convert 'downloads' column to numeric values
    data['Installs'] = data['Installs'].str.replace(',', '').str.replace('+','').astype(float)

    # Convert 'size' column to numeric values
    data['Size'] = data['Size'].str.replace('Varies with device', '0')
    data['Size'] = data['Size'].str.replace('M', '').str.replace('k', '').astype(float)

    # Convert 'price' column to numeric values
    data['Price'] = data['Price'].str.replace('$', '').astype(float)

    # Convert 'rating' column to numeric values
    data['Rating'] = data['Rating'].astype(float)

    # Convert 'reviews' column to numeric values
    data['Reviews'] = data['Reviews'].astype(int)

    # Convert 'last_updated' column to datetime format
    data['Last Updated']=  pd.to_datetime(data['Last Updated'],errors='coerce')
    data['Last Updated'] = data['Last Updated'].dt.strftime('%Y-%m-%d')
    data['Last Updated']=  pd.to_datetime(data['Last Updated'],errors='coerce')

    # Remove rows with duplicate 'id'
    data.drop_duplicates(keep='first', inplace=True)

    # Reset the index of the DataFrame
    data.reset_index(drop=True, inplace=True)

    return data


file_path = '/content/googleplaystore(impure).csv'
data = clean_and_preprocess_data(file_path)
print(playstore_cleaned.shape)

(10346, 13)


  data['Last Updated']=  pd.to_datetime(data['Last Updated'],errors='coerce')


In [96]:
#export cleaned data to csv file
data.to_csv('playstore_clean.csv', index=False)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.100000,159,19.0,10000.0,Free,0.0,Everyone,Art & Design,2018-01-07,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.900000,967,14.0,500000.0,Free,0.0,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.700000,87510,8.7,5000000.0,Free,0.0,Everyone,Art & Design,2018-08-01,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.500000,215644,25.0,50000000.0,Free,0.0,Teen,Art & Design,2018-06-08,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.300000,967,2.8,100000.0,Free,0.0,Everyone,Art & Design;Creativity,2018-06-20,1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10341,Sya9a Maroc - FR,FAMILY,4.500000,38,53.0,5000.0,Free,0.0,Everyone,Education,2017-07-25,1.48,4.1 and up
10342,Fr. Mike Schmitz Audio Teachings,FAMILY,5.000000,4,3.6,100.0,Free,0.0,Everyone,Education,2018-07-06,1,4.1 and up
10343,Parkinson Exercices FR,MEDICAL,4.193338,3,9.5,1000.0,Free,0.0,Everyone,Medical,2017-01-20,1,2.2 and up
10344,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.500000,114,0.0,1000.0,Free,0.0,Mature 17+,Books & Reference,2015-01-19,Varies with device,Varies with device


In [97]:
playstore_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10346 entries, 0 to 10345
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   App             10346 non-null  object        
 1   Category        10346 non-null  object        
 2   Rating          10346 non-null  float64       
 3   Reviews         10346 non-null  int64         
 4   Size            10346 non-null  float64       
 5   Installs        10346 non-null  float64       
 6   Type            10346 non-null  object        
 7   Price           10346 non-null  float64       
 8   Content Rating  10346 non-null  object        
 9   Genres          10346 non-null  object        
 10  Last Updated    10346 non-null  datetime64[ns]
 11  Current Ver     10346 non-null  object        
 12  Android Ver     10346 non-null  object        
dtypes: datetime64[ns](1), float64(4), int64(1), object(7)
memory usage: 1.0+ MB
