# **About Detail Dataset**

The file has data which contains of the application that are accessible by children to analyze the application likelihood in fulfilling COPPA standard in the United States.

The data consists of primaryGenreName:
* Games
* Education
* Entertainment
* Business
* Lifestyle
* Tools
* Music & Audio
* Food & Drink
* Shopping
* Productivity
* Health & Fitness
* Utilities
* Books & Reference
* Finance
* Personalization
* Social
* Travel & Local
* Communication
* News & Magazines
* Medical
* Photography
* Travel
* Social Networking
* Maps & Navigation
* Sports
* Auto & Vehicles
* Art & Design
* News
* Video Players & Editors
* Music
* Photo & Video
* House & Home
* Weather
* Events
* Beauty
* Dating
* Reference
* Stickers
* Book
* Navigation
* Comics
* Graphics & Design
* Parenting
* Libraries & Demo
* Developer Tools
* Magazines & Newspapers

The dataset also contains informations about the details of each applications, including:
1. Developer country
2. App reputation (user rating, number of downloads)
3. Supporting device
4. Terms and Condition settings fulfillment (service, privacy, safety)

In [86]:
import pandas as pd

# Load the datasets
target = pd.read_csv('target.csv')
train = pd.read_csv('train.csv')

In [87]:
# Pastikan index sinkron
train = train.reset_index(drop=True)
target = target.reset_index(drop=True)
merged_data = pd.concat([train, target], axis=1)

In [88]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 17 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   developerCountry                 7000 non-null   object 
 1   countryCode                      6936 non-null   object 
 2   userRatingCount                  7000 non-null   int64  
 3   primaryGenreName                 7000 non-null   object 
 4   downloads                        4851 non-null   object 
 5   deviceType                       7000 non-null   object 
 6   hasPrivacyLink                   6250 non-null   object 
 7   hasTermsOfServiceLink            2365 non-null   object 
 8   hasTermsOfServiceLinkRating      2365 non-null   object 
 9   isCorporateEmailScore            5872 non-null   float64
 10  adSpent                          1321 non-null   float64
 11  appAge                           6950 non-null   float64
 12  averageUserRating   

In [89]:
merged_data.head(100)

Unnamed: 0,developerCountry,countryCode,userRatingCount,primaryGenreName,downloads,deviceType,hasPrivacyLink,hasTermsOfServiceLink,hasTermsOfServiceLinkRating,isCorporateEmailScore,adSpent,appAge,averageUserRating,appContentBrandSafetyRating,appDescriptionBrandSafetyRating,mfaRating,coppaRisk
0,NORWAY,RO,127731,Sports,,smartphone,True,True,low,99.0,14.017220,160.400000,4.0,medium,low,low,False
1,ADDRESS NOT LISTED IN PLAYSTORE,GLOBAL,0,Medical,50 - 100,GLOBAL,True,,,99.0,,17.500000,0.0,,low,low,False
2,UNITED ARAB EMIRATES,CZ,51143,Games,50000000 - 100000000,GLOBAL,True,True,low,0.0,31.883163,30.766667,4.0,,low,low,False
3,GERMANY,GLOBAL,1074,Games,,GLOBAL,True,,,99.0,,71.533333,4.0,,low,low,False
4,CANNOT IDENTIFY COUNTRY,GLOBAL,17,Tools,1000 - 5000,GLOBAL,True,,,99.0,,52.400000,4.0,,low,low,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,ADDRESS NOT LISTED IN PLAYSTORE,GLOBAL,1,Reference,,GLOBAL,True,,,,,69.666667,1.0,,low,low,False
96,UNITED KINGDOM,GLOBAL,0,Business,,GLOBAL,True,,,99.0,,27.900000,0.0,,low,low,False
97,CANNOT IDENTIFY COUNTRY,GLOBAL,0,Games,50 - 100,GLOBAL,True,,,99.0,,58.133333,,,low,low,False
98,CANNOT IDENTIFY COUNTRY,GLOBAL,0,Entertainment,1 - 5,GLOBAL,True,,,0.0,,66.366667,,,medium,low,False


In [90]:
# Check missing values
merged_data['developerCountry'] = merged_data['developerCountry'].replace(['ADDRESS NOT LISTED IN PLAYSTORE', 'CANNOT IDENTIFY COUNTRY'], 'UNKNOWN')
merged_data['downloads'] = merged_data['downloads'].fillna('0 - 0')
merged_data

Unnamed: 0,developerCountry,countryCode,userRatingCount,primaryGenreName,downloads,deviceType,hasPrivacyLink,hasTermsOfServiceLink,hasTermsOfServiceLinkRating,isCorporateEmailScore,adSpent,appAge,averageUserRating,appContentBrandSafetyRating,appDescriptionBrandSafetyRating,mfaRating,coppaRisk
0,NORWAY,RO,127731,Sports,0 - 0,smartphone,True,True,low,99.0,14.017220,160.400000,4.0,medium,low,low,False
1,UNKNOWN,GLOBAL,0,Medical,50 - 100,GLOBAL,True,,,99.0,,17.500000,0.0,,low,low,False
2,UNITED ARAB EMIRATES,CZ,51143,Games,50000000 - 100000000,GLOBAL,True,True,low,0.0,31.883163,30.766667,4.0,,low,low,False
3,GERMANY,GLOBAL,1074,Games,0 - 0,GLOBAL,True,,,99.0,,71.533333,4.0,,low,low,False
4,UNKNOWN,GLOBAL,17,Tools,1000 - 5000,GLOBAL,True,,,99.0,,52.400000,4.0,,low,low,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,UNKNOWN,GLOBAL,0,Utilities,0 - 0,GLOBAL,True,,,99.0,,26.266667,0.0,,low,low,False
6996,UNKNOWN,GLOBAL,0,Business,0 - 0,GLOBAL,True,,,,,23.800000,0.0,,low,low,False
6997,UNKNOWN,GLOBAL,0,Personalization,10 - 50,GLOBAL,True,,,0.0,,27.500000,,,medium,low,False
6998,UNKNOWN,GLOBAL,0,Business,10 - 50,GLOBAL,True,False,high,99.0,,124.033333,0.0,,low,low,False


In [91]:
#Check mid value of downloads
def convert_downloads_to_int(download_range):
    try:
        min_str, max_str = download_range.split(' - ')
        min_val = int(min_str.replace(",", ""))
        max_val = int(max_str.replace(",", ""))
        return (min_val + max_val) // 2  # ambil rata-rata tengah
    except:
        return 0  # kalau gagal parsing, isi 0

merged_data['downloads'] = merged_data['downloads'].apply(convert_downloads_to_int)
merged_data


Unnamed: 0,developerCountry,countryCode,userRatingCount,primaryGenreName,downloads,deviceType,hasPrivacyLink,hasTermsOfServiceLink,hasTermsOfServiceLinkRating,isCorporateEmailScore,adSpent,appAge,averageUserRating,appContentBrandSafetyRating,appDescriptionBrandSafetyRating,mfaRating,coppaRisk
0,NORWAY,RO,127731,Sports,0,smartphone,True,True,low,99.0,14.017220,160.400000,4.0,medium,low,low,False
1,UNKNOWN,GLOBAL,0,Medical,75,GLOBAL,True,,,99.0,,17.500000,0.0,,low,low,False
2,UNITED ARAB EMIRATES,CZ,51143,Games,75000000,GLOBAL,True,True,low,0.0,31.883163,30.766667,4.0,,low,low,False
3,GERMANY,GLOBAL,1074,Games,0,GLOBAL,True,,,99.0,,71.533333,4.0,,low,low,False
4,UNKNOWN,GLOBAL,17,Tools,3000,GLOBAL,True,,,99.0,,52.400000,4.0,,low,low,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,UNKNOWN,GLOBAL,0,Utilities,0,GLOBAL,True,,,99.0,,26.266667,0.0,,low,low,False
6996,UNKNOWN,GLOBAL,0,Business,0,GLOBAL,True,,,,,23.800000,0.0,,low,low,False
6997,UNKNOWN,GLOBAL,0,Personalization,30,GLOBAL,True,,,0.0,,27.500000,,,medium,low,False
6998,UNKNOWN,GLOBAL,0,Business,30,GLOBAL,True,False,high,99.0,,124.033333,0.0,,low,low,False
