# UC San Diego: Data Science in Practice
## Final Project Title (change this to your project's title)

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [X] NO - keep private

# Names

- Zhirui Xia
- Zehan Li
- Yue Yin
- Xiaojie Chen
- Chenri Luo


# Overview

* Write a clear, 3-4 sentence summary of what you did and why.

<a id='research_question'></a>
# Research Question

Do `Genre`, `Price`, and `Update Frequency` have a statistically significant influence, individually and together, on the `Version Score` of an app? Can these variables be used to predict the version score of an app?

<a id='background'></a>

## Background & Prior Work

Mobile applications have become an essential part of daily life, catering to various user needs ranging from entertainment to productivity. Both the genre and the content rating of an app are crucial attributes that can determine its reception among users, which can be measured by the number of reviews and user ratings it receives[^Khalid2015].

There have been several studies analyzing aspects of mobile applications and their performance in the market. In a study by Mo et al. (2017), they examined the relationship between the number of downloads, ratings, and the category of Android apps. They discovered that different app categories had varied ratings, and this variance significantly influenced the number of download[^Mo2017].

In a different study, Fu et al. (2013) analyzed the impact of age ratings on the success of iOS games, and they found that games rated for older age groups had fewer downloads but higher revenues[^Fu2013]. While these studies have looked at genre or content rating separately, our research will extend this line of inquiry by investigating how these two factors together influence an app's reviews and ratings.

Other relevant research includes a study by Khalid et al. (2015), where they investigated the common reasons behind negative app reviews. They found that bugs, app functionality, and customer service were significant factors that led to lower ratings[^Khalid2015]. This background could provide valuable context in understanding the dynamics of user reviews and ratings.

[^Khalid2015]: Khalid, H., Shihab, E., Nagappan, M., & Hassan, A. E. (2015). What Do Mobile App Users Complain About? A Study on Free iOS Apps. IEEE Software, 32(3), 70-77.

[^Mo2017]: Mo, K., Tan, C., & Lu, S. (2017). Watch Out for This! A Study of the Factors that Influence Mobile Application Downloads. 2017 13th International Conference on Computational Intelligence and Security (CIS).

[^Fu2013]: Fu, F., Zhang, L., & Chan, K. (2013). A multilevel model of free-to-play games: An empirical study of a game's lifetime and revenue. Expert Systems with Applications, 40(8), 3166-3173.

# Hypothesis


Research Hypothesis: 

There is a significant relationship between the `genre`, `price of the app`, and its `update frequency` (calculated from Released_Date and Updated_Date) of an app, and the `current version score` of the apps in Apple AppStore.

Null hypothesis:

There is no significant relationship between the genre, price, and update frequency of an app, and the Current_Version_Score of the apps in Apple AppStore.

# Dataset(s)

- **Dataset Name**: Apple App Data
- **Link to the dataset**: https://www.kaggle.com/datasets/gauthamp10/apple-appstore-apps
- **Number of observations**: 129465 × 21
- **Description of the dataset**: 

This dataset is about information of applications on Apple AppStore. 

The dataset has **21** columns and **1.2 Million+** App Data but for the purpose of our project, we have shrinked it to **129465 x 21** observations and each row representing a unique app. The columns contain the following information:

This datasets provides valuable insights into **user ratings analysis** and investigating potential correlations to it. By exploring the distribution of average user ratings basing on different other variables, this dataset can help to understand factors that are important in getting high user satisfaction for app developers.

# Ethics & Privacy

In this project, ethics and privacy are critical and is one of our major concerns. The proposed data potentially includes sensitive information such as user reviews and ratings. Biases might arised from the collection of usage of such data, potentially leaking sensitive information or leads to biased result of our project. For example, the app reviews data might exclude certain populations based on the various contents of the app, and the potential customers of each app differ by a large scale, which can lead to the uncomprehensive result and unfair or discriminatory analysis. To prevent such problems, it is important to conduct a thorough review of the dataset before analysis, identify any biases, and take actions to mitigate them. During our analysis, we must remain vigilant for biases and evaluate the impact of any findings on different demographic and different age groups. Afterward, when discussing the analysis, transparency about the dataset's limitations should also be maintained and keep in mind for responsible and ethical approach. Furthermore, in order to be thorough, the misuse of user data for targeted advertising without explicit consent could infrige upon user privacy. In order to handle it, it is necessary to anonymize and aggregfate the data, use dataset from public source that has user consent for data usage, and ensure compliance with any privacy regulations. In summary, ethics and privacy are essential to data scientists. Addressing concerns, mitigating biases, protecting privacy, and complying with ethics are significant during the data collection, analysis, and communication processes. Transparency, fairness, and responsible data handling are important to ensure an unbiased and ethical project.

# Data Wrangling

1. Download the datasets:

   We have pulled our dataset to our group github: https://github.com/drsimpkins-teaching/cogs108_ss1_23_group_13/blob/main/app.csv
   
   The first step is to download and store the csv file in the current working directory. 
<br>

2. Import necessary libraries: 
   
   The first step is to import the required libraries. 
   
   **numpy** and **pandas** libraries are imported using the aliases np and pd, respectively.  
<br>

3. Read the data from the CSV file: 
   
   Using `pd.read_csv()` function from pandas read the data from the CSV file (app.csv) and create a DataFrame. 
   
   **NOTE**: The CSV file (app.csv) should be located in the current working directory.
<br>

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.express as px

app = pd.read_csv("app.csv")
app.head()

Unnamed: 0,App_Id,App_Name,AppStore_Url,Primary_Genre,Content_Rating,Size_Bytes,Required_IOS_Version,Released,Updated,Version,...,Currency,Free,DeveloperId,Developer,Developer_Url,Developer_Website,Average_User_Rating,Reviews,Current_Version_Score,Current_Version_Reviews
0,com.hkbu.arc.apaper,A+ Paper Guide,https://apps.apple.com/us/app/a-paper-guide/id...,Education,4+,21993472.0,8.0,2017-09-28T03:02:41Z,2018-12-21T21:30:36Z,1.1.2,...,USD,True,1375410542,HKBU ARC,https://apps.apple.com/us/developer/hkbu-arc/i...,,0.0,0,0.0,0
1,com.dmitriev.abooks,A-Books,https://apps.apple.com/us/app/a-books/id103157...,Book,4+,13135872.0,10.0,2015-08-31T19:31:32Z,2019-07-23T20:31:09Z,1.3,...,USD,True,1031572001,Roman Dmitriev,https://apps.apple.com/us/developer/roman-dmit...,,5.0,1,5.0,1
2,no.terp.abooks,A-books,https://apps.apple.com/us/app/a-books/id145702...,Book,4+,21943296.0,9.0,2021-04-14T07:00:00Z,2021-05-30T21:08:54Z,1.3.1,...,USD,True,1457024163,Terp AS,https://apps.apple.com/us/developer/terp-as/id...,,0.0,0,0.0,0
3,fr.antoinettefleur.Book1,A-F Book #1,https://apps.apple.com/us/app/a-f-book-1/id500...,Book,4+,81851392.0,8.0,2012-02-10T03:40:07Z,2019-10-29T12:40:37Z,1.2,...,USD,False,439568839,i-editeur.com,https://apps.apple.com/us/developer/i-editeur-...,,0.0,0,0.0,0
4,com.imonstersoft.azdictionaryios,A-Z Synonyms Dictionary,https://apps.apple.com/us/app/a-z-synonyms-dic...,Reference,4+,64692224.0,9.0,2020-12-16T08:00:00Z,2020-12-18T21:36:11Z,1.0.1,...,USD,True,656731821,Ngov chiheang,https://apps.apple.com/us/developer/ngov-chihe...,http://imonstersoft.com,0.0,0,0.0,0


# Data Cleaning

#### 1. Removing Rows with No Reviews:

Any row in the DataFrame where the 'Reviews' column has a value less than or equal to zero is removed. This effectively removes apps that have no reviews, which are useless to our analysis.

In [10]:
# Remove useless rows with no reviews
app = app[app['Reviews'] > 0]

#### 2. Dropping Useless Columns:

Columns that are considered irrelevant or redundant for the analysis are dropped from the DataFrame.

In [11]:
# Drop useless columns
app = app.drop(columns=['AppStore_Url','Currency','Developer_Url', 'Developer_Website','Average_User_Rating',
                        'Current_Version_Reviews', 'Required_IOS_Version', 'DeveloperId', 'Developer', 'Developer_Url',
                       'Developer_Website'])

#### 3. Dropping Rows with Null Values:

Any row with at least one missing (NaN) value is dropped from the DataFrame to get rid of missing data.


In [12]:
# Drop rows with null value
app = app.dropna(how='any')
app.head()

Unnamed: 0,App_Id,App_Name,Primary_Genre,Content_Rating,Size_Bytes,Released,Updated,Version,Price,Free,Reviews,Current_Version_Score
1,com.dmitriev.abooks,A-Books,Book,4+,13135872.0,2015-08-31T19:31:32Z,2019-07-23T20:31:09Z,1.3,0.0,True,1,5.0
10,com.pitashi.audiojoy.aacompanionfree,AA Audio Companion for Alcoholics Anonymous,Book,17+,26133504.0,2017-04-19T13:24:42Z,2017-08-24T00:29:56Z,3.6.1,0.0,True,1285,4.78132
11,com.goodbarber.bigbookfree,AA Big Book (Unofficial),Book,17+,63112192.0,2015-05-12T07:45:22Z,2021-09-18T18:55:21Z,2.2.16,0.0,True,1839,4.78902
12,com.laltrello.aabigbookandmore,AA Big Book and More,Lifestyle,4+,3095552.0,2012-04-02T11:01:26Z,2017-04-11T03:25:00Z,4.0,1.99,False,242,4.67354
13,com.aabigbook.appstore,AA Big Book App - Unofficial,Book,17+,2094080.0,2015-12-19T00:41:11Z,2018-10-17T20:01:47Z,1.4.2,0.99,False,21,3.09524


# Data Visualization

* This is a good place for some relevant visualizations related to any exploratory data anlayses (EDA) you did after the basic cleaning.

# Data Analysis & Results

* Include cells that describe the steps in your data analysis.
* You'll likely also have some visualizations here as well.

In [1]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Conclusion & Discussion

* Discussion of your results and how they address your experimental question(s).
* Come to a conclusion about your questions and hypothesis (remember we can only reject or fail to reject the null, we cannot accept the hypothesis. 
* What are the implications of your results?
* Discuss limitations of your analyses.
* You can also discuss future directions this work could be taken.