# Step 1: Find a dataset
https://www.kaggle.com/datasets/yusufdelikkaya/google-play-store-apps-dataset?resource=download

**Description:**
- The dataset comprises anonymized data on apps available on the Google Play Store, capturing various aspects such as ratings, downloads, and categorization.
- The dataset has 10,841 entries, with some columns containing missing values, particularly in "Rating," "Type," "Content Rating," "Current Ver," and "Android Ver".
- This dataset can be utilized for analyzing trends in mobile app usage, user preferences, and app performance metrics across different categories.

As there’s more than one numeric column, I pick the colunm *Reviews* for further analysis.

# Step 2: Work with the data in pandas

## 1. Read in the data
After importing pandas library, I load the dataset into a pandas DataFrame using the read_csv function. This dataset contains information about various mobile apps, including a numeric column for 'Reviews,' which I will analyze.

After loading the data, I will focus specifically on the 'Reviews' column, which contains the number of reviews each app has received.

In [8]:
import pandas as pd

In [9]:
data = pd.read_csv("google_play_store_dataset.csv")

## 2. Data Cleaning
In this step, I will remove any missing (NaN) values from the 'Reviews' column to ensure our calculations are not affected by incomplete entries.

In [13]:
data['Reviews'] = pd.to_numeric(data['Reviews'], errors='coerce')
reviews = data['Reviews'].dropna() 

## 3. Compute the mean, median, and mode

In [14]:
mean_reviews = reviews.mean()
median_reviews = reviews.median()
mode_reviews = reviews.mode()

# Print the results
print("Mean of Reviews:", mean_reviews)
print("Median of Reviews:", median_reviews)
print("Mode of Reviews:", mode_reviews[0] if not mode_reviews.empty else "No mode")

Mean of Reviews: 444152.89603321033
Median of Reviews: 2094.0
Mode of Reviews: 0.0


The results provide insights into user engagement across different Google Play Store apps:

**1. Mean of Reviews (444,153):** The mean of reviews is about 444,153, representing the average number of reviews across all apps. This relatively high mean suggests that some apps have a substantial number of reviews, likely popular apps with millions of users. However, this average may be skewed by a few high-review outliers, as indicated by the large gap between the mean and median.

**2. Median of Reviews (2,094):** The median, or the middle value, is significantly lower than the mean. This indicates that half of the apps have fewer than 2,094 reviews. It also suggests a skewed distribution, where the majority of apps have relatively low review counts, but a few have very high counts that pull the mean up.

**3. Mode of Reviews (0):** The mode, or the most frequently occurring value, is 0. This tells us that a considerable number of apps have no reviews at all, which could include newly launched or less popular apps that haven’t attracted user attention yet.

# Step 3: Repeat the previous step using only the Python standard library

In [15]:
import csv

## 1. Read in the CSV file and extract the "Reviews" column

I open the CSV file and read each row. I attempt to convert the 'Reviews' column to a float, appending valid numeric values to the reviews list. Non-numeric entries are skipped.

In [17]:
reviews = []
with open('google_play_store_dataset.csv', 'r', encoding='utf-8-sig') as file:
    reader = csv.DictReader(file)
    for row in reader:
        try:
            review_count = float(row['Reviews'])
            reviews.append(review_count)
        except ValueError:
            continue

## 2. Calculate mean, median, and mode:
**(1) Mean:** I calculate the total sum of all values in reviews and then divide by the number of entries.

**(2) Median:** I sort reviews and find the middle element (or average of the two middle elements if the count is even).

**(3) Mode:** I use a dictionary to count occurrences of each unique review value. After counting, I find the maximum occurrence count and identify which values have that count. If there is only one mode, I return that value; if there are multiple modes, I return a list of them.

To be specific, I want to count how many times each unique review value appears in reviews. review_counts is a dictionary where key is each unique value in reviews, value is the count of how many times that value appears.

In [20]:
# Calculate the mean
total_reviews = sum(reviews)
mean_reviews = total_reviews / len(reviews)

# Calculate the median
sorted_reviews = sorted(reviews)
n = len(sorted_reviews)
if n % 2 == 1:
    median_reviews = sorted_reviews[n // 2]
else:
    median_reviews = (sorted_reviews[n // 2 - 1] + sorted_reviews[n // 2]) / 2

# Calculate the mode
review_counts = {}
for review in reviews:
    if review in review_counts:
        review_counts[review] += 1 # Increment the count if the review value already exists
    else:
        review_counts[review] = 1 # Initialize the count to 1 for a new review value

max_count = max(review_counts.values())
mode_reviews = [key for key, count in review_counts.items() if count == max_count]

print("Mean of Reviews:", mean_reviews)
print("Median of Reviews:", median_reviews)
print("Mode of Reviews:", mode_reviews[0] if len(mode_reviews) == 1 else mode_reviews)

Mean of Reviews: 444152.89603321033
Median of Reviews: 2094.0
Mode of Reviews: 0.0


The analysis result of the "Reviews" column in this Google Play Store dataset reveals below insights:

**1. Mean of Reviews (444,153):** The average number of reviews per app is quite high, suggesting that some apps are very popular with significant user engagement. However, this high mean likely indicates the influence of a few apps with exceptionally high review counts, which increases the overall average.

**2. Median of Reviews (2,094):** The median is significantly lower than the mean, showing that half of the apps have fewer than 2,094 reviews. This large difference between the mean and median suggests that the distribution is skewed, with a small number of apps gathering many more reviews than the majority.

**3. Mode of Reviews (0):** The mode, or most frequently occurring value, is zero, meaning that a considerable number of apps have no reviews at all. This implies that there are many lesser-known or newer apps on the platform that haven’t received user engagement.

# Step 4: Data visualization

In [30]:
import pandas as pd
import statistics

# Step 1: Read the dataset using pandas
data = pd.read_csv('google_play_store_dataset.csv')

# Step 2: Clean the data (convert Rating to numeric, handle errors)
data['Rating'] = pd.to_numeric(data['Rating'], errors='coerce')

# Step 3: Drop NaN values from the Rating column
data = data.dropna(subset=['Rating', 'Genres'])

# Step 4: Group by Genre and calculate the mean rating for each genre
genre_ratings = data.groupby('Genres')['Rating'].mean().reset_index()

# Step 5: Sort genres by mean rating for a cleaner output
genre_ratings = genre_ratings.sort_values(by='Rating', ascending=False)

# Step 6: Create a function for generating a sparkline
def generate_sparkline(value, min_value, max_value, max_length=10):
    """Generates a sparkline based on the rating value, scaling it to fit the rating range."""
    scaled_value = int((value - min_value) / (max_value - min_value) * max_length)
    return '🌟' * scaled_value

# Step 7: Generate and print the results
min_rating = 1  # Minimum possible rating value
max_rating = 5  # Maximum possible rating value

print("Rating Statistics by Genre:")
for index, row in genre_ratings.iterrows():
    genre = row['Genres']
    mean_rating = row['Rating']
    sparkline = generate_sparkline(mean_rating, min_rating, max_rating)
    print(f"{genre}: {sparkline} | Mean Rating: {mean_rating:.2f}")


Rating Statistics by Genre:
February 11, 2018: 🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 19.00
Comics;Creativity: 🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.80
Board;Pretend Play: 🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.80
Health & Fitness;Education: 🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.70
Adventure;Brain Games: 🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.60
Strategy;Action & Adventure: 🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.60
Puzzle;Education: 🌟🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.60
Music;Music & Video: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.53
Entertainment;Creativity: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.53
Strategy;Education: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.50
Arcade;Pretend Play: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.50
Tools;Education: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.50
Racing;Pretend Play: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.50
Casual;Brain Games: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.47
Events: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.44
Education;Brain Games: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.43
Adventure;Action & Adventure: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.42
Simulation;Action & Adventure: 🌟🌟🌟🌟🌟🌟🌟🌟 | Mean Rating: 4.42
Word: 🌟🌟🌟🌟🌟🌟

The data shows significant variation in user ratings depending on the genre and the specific category.

**1. Top-rated categories:** Genres like Comics, Creativity, Board Games, Pretend Play, and Health & Fitness consistently received high ratings, averaging around 4.80 or higher. These categories tend to offer interactive and engaging content, which likely contributes to their higher ratings. For example, Comics and Creativity appeal to users' need for entertainment and self-expression, while Health & Fitness resonates with individuals looking to improve their well-being.

**2. Highly rated sub-genres:** Many genres, such as Puzzle, Education, and Strategy, maintain high ratings of around 4.60 to 4.50, indicating that users value apps that are both challenging and educational. For instance, Puzzle games not only entertain but also encourage critical thinking, which may explain their popularity.

**3. Slightly lower-rated categories:** On the lower end, genres like Dating, Health & Fitness - Action & Adventure, and Parenting show ratings below 4.00. These categories may reflect niche markets with more specific user needs or expectations, which could contribute to less overall satisfaction. In the case of Dating apps, users may have higher expectations, given the personal nature of these apps, which can lead to more critical feedback.