# UC San Diego: Data Science in Practice - Data Checkpoint
### Summer Session I 2023 | Instructor : C. Alex Simpkins Ph.D.

## Draft project title if you have one (can be changed later)

# Names

- Zhirui Xia
- Zehan Li
- Yue Yin
- Xiaojie Chen
- Chenri Luo

<a id='research_question'></a>
# Research Question

Do Developer, Price, and Update frequency (calculated from Released_Date and Updated) together affect the Current Version Score of an app? This question aims to see how the developer's reputation, price of the app, and its update frequency together influence the score of the app's current version.


# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: Apple App Data
- Link to the dataset: https://www.kaggle.com/datasets/gauthamp10/apple-appstore-apps
- Number of observations: 129453

This dataset is about information of applications on Apple AppStore. 

The dataset has **15** columns and **1.2 Million+** App Data but for the purpose of our project, we have shrinked it to **129465** observations and each row representing a unique app. The columns contain the following information:


1. App_Id: A unique identifier for each mobile application in the dataset.

2. App_Name: The name of the mobile application.

3. Primary_Genre: The primary genre or category of the mobile application, such as Business, Education, Book, Games, etc.

4. Content_Rating: The content rating assigned to the app, indicating the age group or audience suitability (e.g., 4+, 17+).

5. Size_Bytes: The size of the mobile application in bytes.

6. Required_IOS_Version: The minimum version of the iOS operating system required to run the app.

7. Released: The date when the app was initially released on the app store.

8. Updated: The date when the app was last updated on the app store.

9. Version: The version number of the app.

10. Price: The price of the app, represented in the dollars.

11. Free: A boolean indicator (True/False) denoting whether the app is free or not.

12. DeveloperId: A unique identifier for the app developer.

13. Developer: The name of the app developer or development company.

14. Average_User_Rating: The average user rating given to the app by users who have rated it.

15. Current_Version_Score: The current version score of the app for the latest version available on the app store.

This datasets provides valuable insights into **user ratings analysis** and investigating potential correlations to it. By exploring the distribution of average user ratings basing on different other variables, this dataset can help to understand factors that are important in getting high user satisfaction for app developers.

# Data Wrangling

1. Download the datasets:

   We have pulled our dataset to our group github: https://github.com/drsimpkins-teaching/cogs108_ss1_23_group_13/blob/main/app.csv
   
   The first step is to download and store the csv file in the current working directory. 
<br>

2. Import necessary libraries: 
   
   The first step is to import the required libraries. 
   
   **numpy** and **pandas** libraries are imported using the aliases np and pd, respectively.  
<br>

3. Read the data from the CSV file: 
   
   Using `pd.read_csv()` function from pandas read the data from the CSV file (app.csv) and create a DataFrame. 
   
   **NOTE**: The CSV file (app.csv) should be located in the current working directory.
<br>

4. Adding two column to transfer the dataset to more usable form

    As not all information we need for our research question is included in the raw dataset, we added two columns `Reputation` and `Update_Frequency` for easier analysis. 

    (1). adding `reputation` column
    
    As one variable our question is investigating is about the developer's reputation and this piece of information is not included in the raw dataset, we need to generate a new column about reputation score ourselves using the given information. 
    
    Our idea is to calculate the average user rating for each unique Developer by first grouping by `DeveloperId` and then caculate the mean rating of each unique developer. Then, we assign the average rating, which is regarded as **developer reputation**, back to the data frame by mapping. 
     
    (2). adding `Update_Frequency` column
   
   We also want to investigate about update frequency of the app. Our idea for creating a new column on update frequency is to first convert the `Released` and `Updated` columns to pandas DateTime objects. Then we calculate the frequency through the ratio of days since release to days since the last update. 

In [5]:
import numpy as np
import pandas as pd
# import seaborn as sns

app = pd.read_csv("app.csv")

## Create Reputation column
newValue = app.groupby(['DeveloperId'])['Average_User_Rating'].mean()
app['Reputation'] = app['DeveloperId'].map(newValue)

## Create Update_Frequency column
# Convert 'Released' and 'Updated' columns to a specific time zone (e.g., UTC)
app['Released'] = pd.to_datetime(app['Released']).dt.tz_convert('UTC')
app['Updated'] = pd.to_datetime(app['Updated']).dt.tz_convert('UTC')

# Get the current date with the same time zone as the other Timestamps
current_date = pd.Timestamp.now(tz='UTC')

# Perform the calculations
days_since_released = (current_date - app['Released']).dt.total_seconds() / (60 * 60 * 24)
days_since_updated = (current_date - app['Updated']).dt.total_seconds() / (60 * 60 * 24)

# Create Update_Frequency column
app['Update_Frequency'] = days_since_released / days_since_updated
app.head()

Unnamed: 0,App_Id,App_Name,AppStore_Url,Primary_Genre,Content_Rating,Size_Bytes,Required_IOS_Version,Released,Updated,Version,...,DeveloperId,Developer,Developer_Url,Developer_Website,Average_User_Rating,Reviews,Current_Version_Score,Current_Version_Reviews,Reputation,Update_Frequency
0,com.hkbu.arc.apaper,A+ Paper Guide,https://apps.apple.com/us/app/a-paper-guide/id...,Education,4+,21993472.0,8.0,2017-09-28 03:02:41+00:00,2018-12-21 21:30:36+00:00,1.1.2,...,1375410542,HKBU ARC,https://apps.apple.com/us/developer/hkbu-arc/i...,,0.0,0,0.0,0,1.166666,1.268779
1,com.dmitriev.abooks,A-Books,https://apps.apple.com/us/app/a-books/id103157...,Book,4+,13135872.0,10.0,2015-08-31 19:31:32+00:00,2019-07-23 20:31:09+00:00,1.3,...,1031572001,Roman Dmitriev,https://apps.apple.com/us/developer/roman-dmit...,,5.0,1,5.0,1,5.0,1.974387
2,no.terp.abooks,A-books,https://apps.apple.com/us/app/a-books/id145702...,Book,4+,21943296.0,9.0,2021-04-14 07:00:00+00:00,2021-05-30 21:08:54+00:00,1.3.1,...,1457024163,Terp AS,https://apps.apple.com/us/developer/terp-as/id...,,0.0,0,0.0,0,0.0,1.059547
3,fr.antoinettefleur.Book1,A-F Book #1,https://apps.apple.com/us/app/a-f-book-1/id500...,Book,4+,81851392.0,8.0,2012-02-10 03:40:07+00:00,2019-10-29 12:40:37+00:00,1.2,...,439568839,i-editeur.com,https://apps.apple.com/us/developer/i-editeur-...,,0.0,0,0.0,0,0.0,3.069675
4,com.imonstersoft.azdictionaryios,A-Z Synonyms Dictionary,https://apps.apple.com/us/app/a-z-synonyms-dic...,Reference,4+,64692224.0,9.0,2020-12-16 08:00:00+00:00,2020-12-18 21:36:11+00:00,1.0.1,...,656731821,Ngov chiheang,https://apps.apple.com/us/developer/ngov-chihe...,http://imonstersoft.com,0.0,0,0.0,0,1.12705,1.002715


# Data Cleaning

In analyzing our dataset, we incorporated various data points, such as App ID, product name, category, age rating, size, required iOS version, release date, update date, price, Developer ID, and user reviews among other details.

From this collection, we excluded certain data we deemed unnecessary. This included the App Store URL, the price exchange rate, the publisher's personal website, the publisher's personal URL, and comments related to the current version. Such data was not pertinent to our ongoing discussion.

Next, we removed any rows in our remaining dataset that contained missing values.

Ultimately, this process led us to our refined dataset.

In [3]:
# Drop useless columns
app = app.drop(columns=['AppStore_Url','Currency','Developer_Url', 'Developer_Website','Reviews',
                        'Current_Version_Reviews'])
# # Drop rows with null value
app = app.dropna(how='any')
app.head()

Unnamed: 0,App_Id,App_Name,Primary_Genre,Content_Rating,Size_Bytes,Required_IOS_Version,Released,Updated,Version,Price,Free,DeveloperId,Developer,Average_User_Rating,Current_Version_Score,Reputation,Update_Frequency
0,com.hkbu.arc.apaper,A+ Paper Guide,Education,4+,21993472.0,8.0,2017-09-28 03:02:41+00:00,2018-12-21 21:30:36+00:00,1.1.2,0.0,True,1375410542,HKBU ARC,0.0,0.0,1.166666,1.268781
1,com.dmitriev.abooks,A-Books,Book,4+,13135872.0,10.0,2015-08-31 19:31:32+00:00,2019-07-23 20:31:09+00:00,1.3,0.0,True,1031572001,Roman Dmitriev,5.0,5.0,5.0,1.974394
2,no.terp.abooks,A-books,Book,4+,21943296.0,9.0,2021-04-14 07:00:00+00:00,2021-05-30 21:08:54+00:00,1.3.1,0.0,True,1457024163,Terp AS,0.0,0.0,0.0,1.059548
3,fr.antoinettefleur.Book1,A-F Book #1,Book,4+,81851392.0,8.0,2012-02-10 03:40:07+00:00,2019-10-29 12:40:37+00:00,1.2,2.99,False,439568839,i-editeur.com,0.0,0.0,0.0,3.06969
4,com.imonstersoft.azdictionaryios,A-Z Synonyms Dictionary,Reference,4+,64692224.0,9.0,2020-12-16 08:00:00+00:00,2020-12-18 21:36:11+00:00,1.0.1,0.0,True,656731821,Ngov chiheang,0.0,0.0,1.12705,1.002715


In [4]:
app.shape

(129453, 17)