Análise exploratória do Dataset: 'Google Play Store Apps', disponibilizado pelo Kaggle.

(https://www.kaggle.com/datasets/lava18/google-play-store-apps)

O Dataset contém dados de 10.000 aplicativos do Play Store, para Android.

Trabalho desenvolvido por:
- Dário Rodrigues (https://github.com/dariornj)
- Wender Enzo (https://github.com/wenderenzo123)

Parte da avaliação do curso Técnicas de Programação I - Suzano e Ada

In [2]:
# Pacotes necessários

import pandas as pd
import numpy as np

In [3]:
# Carregando os arquivos

df_main = pd.read_csv('googleplaystore.csv')
df_review = pd.read_csv('googleplaystore_user_reviews.csv')

In [4]:
# Verificando as colunas presentes e quantidade de células não nulas

df_main.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [5]:
df_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [6]:
# Checando o conteúdo de cada tabela para verificar o tipo de dado

df_main.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [7]:
df_review.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


É possível concluir que:
- Apenas as categorias Rating (df_main) e Sentiment_Polarity e Sentiment_Subjectivity (df_review) são float

- O número de reviews, instalações, preço (colunas Review, Installs e Price) poderiam ser também numericos.

- Antes de transformar a coluna de Size em numérica, é importante remover o último 'M' que possui nos elementos

- A classificação etária (Content Rating), a data da última atualização (Last Updated), a versão atual (Current Ver) e a versão de Android (Android Ver) podem ser desconsiderados nessa análise

- No df de review, existem várias linhas para o mesmo APP e muitas linhas com NaN.

In [8]:
# Removendo as colunas Content Rating, Last Updated, Current Ver, Android Ver da df principal

colunas_removidas = ['Content Rating', 'Last Updated','Current Ver', 'Android Ver']
df_main.drop(colunas_removidas, axis = 1, inplace = True)

In [9]:
# Para não remover as linhas completas que contém dados NaN de Rating, podemos substituir o rating pelo valor médio:
df_main['Rating'].fillna(df_main['Rating'].mean(), inplace=True)

In [10]:
# Removendo a única linha que contém NaN em Type:
df_main = df_main.dropna()

In [11]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   App       10840 non-null  object 
 1   Category  10840 non-null  object 
 2   Rating    10840 non-null  float64
 3   Reviews   10840 non-null  object 
 4   Size      10840 non-null  object 
 5   Installs  10840 non-null  object 
 6   Type      10840 non-null  object 
 7   Price     10840 non-null  object 
 8   Genres    10840 non-null  object 
dtypes: float64(1), object(8)
memory usage: 846.9+ KB


In [12]:
# Para o dataframe de review, existem linhas que apenas contém o nome do APP, sem mais nenhuma informação:
df_review.tail()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
64290,Houzz Interior Design Ideas,,,,
64291,Houzz Interior Design Ideas,,,,
64292,Houzz Interior Design Ideas,,,,
64293,Houzz Interior Design Ideas,,,,
64294,Houzz Interior Design Ideas,,,,


In [13]:
# Removendo essas linhas:
df_review = df_review.dropna()

In [14]:
df_review.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB


In [15]:
df_main

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.100000,159,19M,"10,000+",Free,0,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.900000,967,14M,"500,000+",Free,0,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.700000,87510,8.7M,"5,000,000+",Free,0,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.500000,215644,25M,"50,000,000+",Free,0,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.300000,967,2.8M,"100,000+",Free,0,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.500000,38,53M,"5,000+",Free,0,Education
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.000000,4,3.6M,100+,Free,0,Education
10838,Parkinson Exercices FR,MEDICAL,4.193338,3,9.5M,"1,000+",Free,0,Medical
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.500000,114,Varies with device,"1,000+",Free,0,Books & Reference


In [16]:
#Verificando os elementos da coluna Installs:

df_main['Installs'].value_counts()

1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
Free                 1
Name: Installs, dtype: int64

In [17]:
# Removendo a linha que contem 'Free' na coluna de Installs:

mask = df_main['Installs'] == 'Free'
df_main = df_main.loc[~mask].reset_index()

In [18]:
# Na coluna de número de instalações, removendo as vírgulas e símbolos '+' e convertendo para int:

df_main['Installs'] = df_main['Installs'].apply(lambda x: x.replace('+',''))
df_main['Installs'] = df_main['Installs'].apply(lambda x: x.replace(',',''))
df_main['Installs'] = df_main['Installs'].apply(lambda x: int(x))


In [19]:
# Na coluna de Price, remover "$" e transformar em float:
df_main['Price'] = df_main['Price'].apply(lambda x: x.replace('$',''))
df_main['Price'] = df_main['Price'].apply(lambda x: float(x))

In [20]:
# Convertendo Reviews para inteiro:
df_main['Reviews'] = df_main['Reviews'].apply(lambda x: int(x))

In [21]:
# Removendo linhas duplicadas:
df_main = df_main.drop_duplicates('App')


In [22]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9658 entries, 0 to 10838
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   index     9658 non-null   int64  
 1   App       9658 non-null   object 
 2   Category  9658 non-null   object 
 3   Rating    9658 non-null   float64
 4   Reviews   9658 non-null   int64  
 5   Size      9658 non-null   object 
 6   Installs  9658 non-null   int64  
 7   Type      9658 non-null   object 
 8   Price     9658 non-null   float64
 9   Genres    9658 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 830.0+ KB


Com os valores numéricos devidamente transformados, podemos ter uma visão estatística geral dos dados

In [23]:
df_main.describe()

Unnamed: 0,index,Rating,Reviews,Installs,Price
count,9658.0,9658.0,9658.0,9658.0,9658.0
mean,5665.812384,4.176285,216615.0,7778312.0,1.099413
std,3102.321144,0.494391,1831413.0,53761000.0,16.853021
min,0.0,1.0,0.0,0.0,0.0
25%,3111.25,4.0,25.0,1000.0,0.0
50%,5813.5,4.2,967.0,100000.0,0.0
75%,8326.75,4.5,29408.0,1000000.0,0.0
max,10840.0,5.0,78158310.0,1000000000.0,400.0


In [24]:
df_review.describe()

Unnamed: 0,Sentiment_Polarity,Sentiment_Subjectivity
count,37427.0,37427.0
mean,0.182171,0.49277
std,0.351318,0.259904
min,-1.0,0.0
25%,0.0,0.357143
50%,0.15,0.514286
75%,0.4,0.65
max,1.0,1.0


Algumas perguntas podem ser respondidas:

In [47]:
# 1. Qual a categoria de aplicativos mais popular?

print("1. Quais as categorias de aplicativos com maior variedade?")
categoria = df_main.groupby(['Category']).size().sort_values(ascending=False)
df_categoria = pd.DataFrame(categoria)
df_categoria.columns = ['Number of Apps']
df_categoria.head(5)


1. Quais as categorias de aplicativos com maior variedade?


Unnamed: 0_level_0,Number of Apps
Category,Unnamed: 1_level_1
FAMILY,1831
GAME,959
TOOLS,827
BUSINESS,420
MEDICAL,395


In [46]:
# 2. Quais as categorias de aplicativos com maior variedade dentre os pagos?

print("2. Quais as categorias de aplicativos com maior variedade dentre os pagos?")
categoria = df_main[df_main['Type'] == 'Paid'].groupby('Category').size().sort_values(ascending=False)
df_categoria = pd.DataFrame(categoria)
df_categoria.columns = ['Number of Apps']
df_categoria.head(5)

2. Quais as categorias de aplicativos com maior variedade dentre os pagos?


Unnamed: 0_level_0,Number of Apps
Category,Unnamed: 1_level_1
FAMILY,183
MEDICAL,83
GAME,82
PERSONALIZATION,81
TOOLS,78


In [45]:
# 3. Quais as categorias de aplicativos com maior variedade dentre os gratuitos?

print("3. Quais as categorias de aplicativos com maior variedade dentre os gratuitos?")
categoria = df_main[df_main['Type'] == 'Free'].groupby('Category').size().sort_values(ascending=False)
df_categoria = pd.DataFrame(categoria)
df_categoria.columns = ['Number of Apps']
df_categoria.head(5)

3. Quais as categorias de aplicativos com maior variedade dentre os gratuitos?


Unnamed: 0_level_0,Number of Apps
Category,Unnamed: 1_level_1
FAMILY,1648
GAME,877
TOOLS,749
BUSINESS,408
LIFESTYLE,350


In [44]:
# 4. Quais as categorias de aplicativos com maior varidade e avaliação maior que 4.5?

print("4. Quais as categorias de aplicativos com maior varidade e avaliação maior que 4.5?")
categoria = df_main[df_main['Rating'] > 4.5].groupby('Category').size().sort_values(ascending=False)
df_categoria = pd.DataFrame(categoria)
df_categoria.columns = ['Number of Apps']
df_categoria.head(5)



4. Quais as categorias de aplicativos com maior varidade e avaliação maior que 4.5?


Unnamed: 0_level_0,Number of Apps
Category,Unnamed: 1_level_1
FAMILY,342
GAME,159
TOOLS,111
HEALTH_AND_FITNESS,89
MEDICAL,81


In [48]:
# 5. Quais as categorias de aplicativos com maior varidade e avaliação menor que 4.0?

print("5. Quais as categorias de aplicativos com maior varidade e avaliação menor que 4.0?")
categoria = df_main[df_main['Rating'] < 4.0].groupby('Category').size().sort_values(ascending=False)
df_categoria = pd.DataFrame(categoria)
df_categoria.columns = ['Number of Apps']
df_categoria.head(5)

5. Quais as categorias de aplicativos com maior varidade e avaliação menor que 4.0?


Unnamed: 0_level_0,Number of Apps
Category,Unnamed: 1_level_1
FAMILY,399
TOOLS,229
GAME,149
LIFESTYLE,96
BUSINESS,77


In [52]:
# 6. Quais as categorias de aplicativos com maior varidade e avaliação igual a 5.0?
print("6. Quais as categorias de aplicativos com maior varidade e avaliação igual a 5.0?")
categoria = df_main[df_main['Rating'] == 5.0].groupby('Category').size().sort_values(ascending=False)
df_categoria = pd.DataFrame(categoria)
df_categoria.columns = ['Number of Apps']
df_categoria.head(5)


6. Quais as categorias de aplicativos com maior varidade e avaliação igual a 5.0?


Unnamed: 0_level_0,Number of Apps
Category,Unnamed: 1_level_1
FAMILY,67
LIFESTYLE,29
MEDICAL,25
BUSINESS,18
TOOLS,17


In [53]:
# 7. Quais os 10 aplicativos mais populares?

print("7. Quais os 10 aplicativos mais populares?")
ranking = df_main.sort_values(by = 'Reviews', ascending=False).head(10)
colunas = ['App', 'Reviews']
novo_df = ranking[colunas]
novo_df.index = np.arange(1, 11)
novo_df


7. Quais os 10 aplicativos mais populares?


Unnamed: 0,App,Reviews
1,Facebook,78158306
2,WhatsApp Messenger,69119316
3,Instagram,66577313
4,Messenger – Text and Video Chat for Free,56642847
5,Clash of Clans,44891723
6,Clean Master- Space Cleaner & Antivirus,42916526
7,Subway Surfers,27722264
8,YouTube,25655305
9,"Security Master - Antivirus, VPN, AppLock, Boo...",24900999
10,Clash Royale,23133508


In [54]:
# 8. Quais os 10 aplicativos mais populares entre os aplicativos pagos?
print("8. Quais os 10 aplicativos mais populares entre os aplicativos pagos?")

ranking = df_main[df_main['Type'] == 'Paid'].sort_values(by = 'Reviews', ascending=False).head(10)
colunas = ['App', 'Reviews']
novo_df = ranking[colunas]
novo_df.index = np.arange(1, 11)
novo_df


8. Quais os 10 aplicativos mais populares entre os aplicativos pagos?


Unnamed: 0,App,Reviews
1,Minecraft,2376564
2,Hitman Sniper,408292
3,Grand Theft Auto: San Andreas,348962
4,Bloons TD 5,190086
5,Where's My Water?,188740
6,Card Wars - Adventure Time,129603
7,True Skate,129409
8,Five Nights at Freddy's,100805
9,Beautiful Widgets Pro,97890
10,DraStic DS Emulator,87766


In [55]:
# 9. Quais os 10 aplicativos mais populares entre os aplicativos gratuitos?
print("9. Quais os 10 aplicativos mais populares entre os aplicativos gratuitos?")

ranking = df_main[df_main['Type'] == 'Free'].sort_values(by = 'Reviews', ascending=False).head(10)
colunas = ['App', 'Reviews']
novo_df = ranking[colunas]
novo_df.index = np.arange(1, 11)
novo_df

9. Quais os 10 aplicativos mais populares entre os aplicativos gratuitos?


Unnamed: 0,App,Reviews
1,Facebook,78158306
2,WhatsApp Messenger,69119316
3,Instagram,66577313
4,Messenger – Text and Video Chat for Free,56642847
5,Clash of Clans,44891723
6,Clean Master- Space Cleaner & Antivirus,42916526
7,Subway Surfers,27722264
8,YouTube,25655305
9,"Security Master - Antivirus, VPN, AppLock, Boo...",24900999
10,Clash Royale,23133508


In [57]:
# 10. Qual a categoria de aplicativos mais popular entre os aplicativos com mais de 1 milhão de instalações?
print("10. Quais as categorias de aplicativos mais populares entre os aplicativos com mais de 1 milhão de instalações?")

ranking = df_main[df_main['Installs'] > 1000000].sort_values(by = 'Reviews', ascending=False).head(10)
colunas = ['Category', 'App', 'Reviews', 'Installs']
novo_df = ranking[colunas]
novo_df.index = np.arange(1, 11)
novo_df

10. Quais as categorias de aplicativos mais populares entre os aplicativos com mais de 1 milhão de instalações?


Unnamed: 0,Category,App,Reviews,Installs
1,SOCIAL,Facebook,78158306,1000000000
2,COMMUNICATION,WhatsApp Messenger,69119316,1000000000
3,SOCIAL,Instagram,66577313,1000000000
4,COMMUNICATION,Messenger – Text and Video Chat for Free,56642847,1000000000
5,GAME,Clash of Clans,44891723,100000000
6,TOOLS,Clean Master- Space Cleaner & Antivirus,42916526,500000000
7,GAME,Subway Surfers,27722264,1000000000
8,VIDEO_PLAYERS,YouTube,25655305,1000000000
9,TOOLS,"Security Master - Antivirus, VPN, AppLock, Boo...",24900999,500000000
10,GAME,Clash Royale,23133508,100000000
