<p align="center"><font size="6"><b>Notebook de Charles-Henri SAINT-MARS</b></font></p>

# **Pandas**

In [1]:
import pandas as pd
import numpy as np

## 1. Les séries

Une série est un tableau à une dimension qui contient des données de même type. Chaque élément est associé à un indice. <br>
On peut créer une série à partir d'une liste, d'un tableau numpy ou d'un dictionnaire. <br>
On peut aussi créer une série vide et lui ajouter des éléments. <br>
On peut nommer une série et nommer ses éléments. <br>
On peut accéder aux éléments d'une série par leur indice ou par leur nom. <br>
On peut aussi modifier les éléments d'une série. <br>
On peut aussi effectuer des opérations mathématiques sur les éléments d'une série.


In [2]:
serie = pd.Series([11, 15, 18, 22, 24, 29, 30, 34, 38], name="Ma série de nombres")
serie

0    11
1    15
2    18
3    22
4    24
5    29
6    30
7    34
8    38
Name: Ma série de nombres, dtype: int64

In [3]:
serie = pd.Series([11, 15, 18, 22, 24, 29, 30, 34], index=["Fred", "Joan", "Ralph", "Penny", "Alex", "Beth", "John", "Sue"])
serie

Fred     11
Joan     15
Ralph    18
Penny    22
Alex     24
Beth     29
John     30
Sue      34
dtype: int64

In [4]:
# La distrisbution statistique de la serie 
serie.describe()

count     8.000000
mean     22.875000
std       7.936129
min      11.000000
25%      17.250000
50%      23.000000
75%      29.250000
max      34.000000
dtype: float64

In [5]:
# std : écart type (écart moyen entre les valeurs et la moyenne)
# 25% : 1er quartile (25% des valeurs sont inferieures ou égales a cette valeur)
# 50% : mediane (50% des valeurs sont inferieures ou égales a cette valeur)
# 75% : 3eme quartile (75% des valeurs sont inferieures ou égales a cette valeur)

In [6]:
serie["Alex"]

24

In [7]:
serie[4] # Alex

24

Sélection d"éléments d'une série

In [8]:
serie[["Alex", "Beth", "John", "Sue"]] # selection de plusieurs valeurs

Alex    24
Beth    29
John    30
Sue     34
dtype: int64

In [9]:
serie > 20 # renvoie une serie de booléens

Fred     False
Joan     False
Ralph    False
Penny     True
Alex      True
Beth      True
John      True
Sue       True
dtype: bool

In [10]:
serie[serie > 20] # renvoie les valeurs de la serie qui sont superieures à 20

Penny    22
Alex     24
Beth     29
John     30
Sue      34
dtype: int64

In [11]:
serie.min()

11

In [12]:
serie.max()

34

In [13]:
serie.max()

34

## 2. Les dataframes

2.1 Dataframe de base

In [14]:
# Un dataframe de base
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(df)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


In [15]:
# Faire la somme des colonnes
df.sum(axis=0)

A     6
B    15
C    24
dtype: int64

In [16]:
# Faire la somme des lignes
df.sum(axis=1)

0    12
1    15
2    18
dtype: int64

2.2 Création d'un dataframe à partir d'un fichier CSV

In [17]:
# On crée un dataframe à partir d'un fichier CSV
col_names =  ['Prix', 'Taille de la maison', 'Surface du terrain']
pd_houses = pd.read_csv('houses.csv', names = col_names, )  # On charge le fichier csv
print(pd_houses)

      Prix  Taille de la maison  Surface du terrain
0   212000                 4148               25264
1   230000                 2501               11891
2   339000                 4374               25351
3   289000                 2398               22215
4   160000                 2536                9234
5    85000                 2368               13329
6    85000                 1264                8407
7   145000                 1572               12588
8   164000                 2375               16204
9   123500                 1161                9626
10  180000                 1542                8755
11  159500                 1464               14636
12  156000                 2240               21780
13  146500                 1269               11250
14  101500                  924                7361
15  109800                  768               10497
16  182000                 1320               15768
17  110000                 1845               12153
18  125000  

2.3 Rappel : Création d'un array à partir d'un fichier CSV

In [18]:
# On charge le fichier CSV
np_houses = np.genfromtxt('houses.csv', delimiter=',', dtype=int)
np_houses

array([[212000,   4148,  25264],
       [230000,   2501,  11891],
       [339000,   4374,  25351],
       [289000,   2398,  22215],
       [160000,   2536,   9234],
       [ 85000,   2368,  13329],
       [ 85000,   1264,   8407],
       [145000,   1572,  12588],
       [164000,   2375,  16204],
       [123500,   1161,   9626],
       [180000,   1542,   8755],
       [159500,   1464,  14636],
       [156000,   2240,  21780],
       [146500,   1269,  11250],
       [101500,    924,   7361],
       [109800,    768,  10497],
       [182000,   1320,  15768],
       [110000,   1845,  12153],
       [125000,   1274,  13634],
       [ 80000,   1905,  10890]])

2.4 Création d'un dataframe à partir d'un array

In [19]:
dataframe = pd.DataFrame(np_houses, columns=col_names)
dataframe

Unnamed: 0,Prix,Taille de la maison,Surface du terrain
0,212000,4148,25264
1,230000,2501,11891
2,339000,4374,25351
3,289000,2398,22215
4,160000,2536,9234
5,85000,2368,13329
6,85000,1264,8407
7,145000,1572,12588
8,164000,2375,16204
9,123500,1161,9626


## 3. Dataframes et séries

In [20]:
# Création d'une première série
serie1 = pd.Series([11, 15, 12, 13, 14, 12, 11, 11, 10, 14, 16, 16, 14], index=["Fred", "Joan", "Ralph", "Penny", "Alex", "Beth", "John", "Sue", "Jean", "Carl", "Alice", "Bob", "Eve"])
serie1

Fred     11
Joan     15
Ralph    12
Penny    13
Alex     14
Beth     12
John     11
Sue      11
Jean     10
Carl     14
Alice    16
Bob      16
Eve      14
dtype: int64

In [21]:
# Taille de la série 1
len(serie1)

13

In [22]:
# Création d'une seconde série
# Les noms suivants sont communs aux deux séries : Fred, Joan, Alex, John, Sue, Jean, Carl, Alice, Bob
# Les noms suivants sont propres à la première série : Ralph, Penny, Beth, Eve
# Les noms suivants sont propres à la seconde série : Damien, Grace, Chloé, Samuel
serie2 = pd.Series([11, 15, 15, 11, 14, 12, 11, 13, 10, 14, 16, 16, 9], index=["Fred", "Joan", "Damien", "Grace", "Chloé", "Alex", "John", "Sue", "Jean", "Carl", "Alice", "Bob", "Samuel"])
serie2

Fred      11
Joan      15
Damien    15
Grace     11
Chloé     14
Alex      12
John      11
Sue       13
Jean      10
Carl      14
Alice     16
Bob       16
Samuel     9
dtype: int64

In [23]:
# Taille de la série 2
serie2.size

13

In [24]:
# On crée un dataframe à partir de deux séries
dataframe = pd.DataFrame({"Série 1": serie1, "Série 2": serie2})
dataframe

Unnamed: 0,Série 1,Série 2
Alex,14.0,12.0
Alice,16.0,16.0
Beth,12.0,
Bob,16.0,16.0
Carl,14.0,14.0
Chloé,,14.0
Damien,,15.0
Eve,14.0,
Fred,11.0,11.0
Grace,,11.0


In [25]:
# Nombre de ligne du dataframe
len(dataframe)

17

In [26]:
# Nombre d'élément du dataframe
dataframe.size

34

In [27]:
# La valeur maximale de la colonne "Série 2"
dataframe["Série 2"].max()

16.0

In [28]:
# La valeur minimale de la colonne "Série 2"
dataframe["Série 2"].min()

9.0

## 4. Création d'un dataframe à partir d'un fichier CSV du dataset 'Google Play Store Apps' récupéré sur Kaggle
Ce fichier donne des informations sur le marché des applications Android entre  2010 et 2018 et contient les colonnes (variables) suivantes : App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, Android Ver.

Lien : [https://www.kaggle.com/datasets/lava18/google-play-store-apps](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

<font size="6" color="red">>>> Importer, lire et écrire un fichier >>></font>

In [29]:
 # On charge le fichier CSV
google_play_store = pd.read_table('googleplaystore.csv', sep=",")
type(google_play_store)

pandas.core.frame.DataFrame

In [30]:
# Affiche les 5 premières et les 5 dernières lignes du dataframe
google_play_store 

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [31]:
# Type de données de chaque colonne
google_play_store.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [32]:
# Conversion de la colonne "Last Updated" en datetime
last_updated_dates = pd.to_datetime(google_play_store["Last Updated"], errors='coerce')
#last_updated_dates

# Date minimale de la colonne "Last Updated"
print("Date minimale de la colonne 'Last Updated' : ", last_updated_dates.min())

# Date maximale de la colonne "Last Updated"
print("Date maximale de la colonne 'Last Updated' : ", last_updated_dates.max())

Date minimale de la colonne 'Last Updated' :  2010-05-21 00:00:00
Date maximale de la colonne 'Last Updated' :  2018-08-08 00:00:00


In [33]:
# Enregistrer un dataframe dans un fichier csv
google_play_store.to_csv('mon_dataframe.csv')

In [34]:
# On change l'indexation du dataframe
google_play_store.set_index("App")

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [35]:
 # On change l'indexation du dataframe et on l'enregistre dans une nouvelle variable
google_apps_index = google_play_store.set_index("App")
google_apps_index

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


<font size="6" color="red">>>> Accéder aux éléments d'un dataframe >>></font>

In [36]:
# Méthodes de sélection des éléments par ligne et colonne
# iloc : sélection par index
# loc : sélection par label

In [37]:
# Sélectionner l'élément à la 4eme ligne et à la 3eme colonne
google_apps_index.iloc[4,3] 

'2.8M'

In [38]:
# Sélectionner l'élément à la ligne "Pixel Draw - Number Art Coloring Book" et à la colonne "Size"
google_apps_index.loc["Pixel Draw - Number Art Coloring Book", "Size"]

'2.8M'

In [39]:
# Sélectionner la ligne "Udemy - Online Courses" et toutes ses colonnes
google_apps_index.loc["Udemy - Online Courses", :]

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Udemy - Online Courses,EDUCATION,4.5,99020,18M,"1,000,000+",Free,0,Everyone,Education,"August 2, 2018",5.0.4,5.0 and up
Udemy - Online Courses,EDUCATION,4.5,99020,18M,"1,000,000+",Free,0,Everyone,Education,"August 2, 2018",5.0.4,5.0 and up
Udemy - Online Courses,EDUCATION,4.5,99020,18M,"1,000,000+",Free,0,Everyone,Education,"August 2, 2018",5.0.4,5.0 and up


NB : On constate une redondance de cette donnée dans la dataset.

In [40]:
# Afficher les 4 premières lignes avec iloc
google_apps_index.iloc[:4, :] # les index vont de 0 à 3 pour les lignes, on affiche toutes les colonnes

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up


In [41]:
# Afficher les 4 premières colonnes avec iloc
google_apps_index.iloc[:, :4] # on affiche toutes les lignes, les index vont de 0 à 3 pour les colonnes

Unnamed: 0_level_0,Category,Rating,Reviews,Size
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M
Coloring book moana,ART_AND_DESIGN,3.9,967,14M
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M
...,...,...,...,...
Sya9a Maroc - FR,FAMILY,4.5,38,53M
Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M
Parkinson Exercices FR,MEDICAL,,3,9.5M
The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device


In [42]:
 # Afficher les 5 premières lignes de la sélection précédente
google_apps_index.iloc[:, :4].head()

Unnamed: 0_level_0,Category,Rating,Reviews,Size
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M
Coloring book moana,ART_AND_DESIGN,3.9,967,14M
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M


In [43]:
# Afficher les 5 premières lignes des colonnes "Category", "Rating", "Reviews" et "Size"
google_apps_index.loc[:, ["Category", "Rating", "Reviews", "Size"]].head()

Unnamed: 0_level_0,Category,Rating,Reviews,Size
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M
Coloring book moana,ART_AND_DESIGN,3.9,967,14M
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M


In [44]:
# Sélectionner les lignes 2 à 4 et les colonnes 3 à 7
google_apps_index.iloc[2:5, 3:8] 

Unnamed: 0_level_0,Size,Installs,Type,Price,Content Rating
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",8.7M,"5,000,000+",Free,0,Everyone
Sketch - Draw & Paint,25M,"50,000,000+",Free,0,Teen
Pixel Draw - Number Art Coloring Book,2.8M,"100,000+",Free,0,Everyone


<font size="6" color="red">>>> Ajouter/supprimer des colonne d'un dataframe >>></font>

In [45]:
# Axis de dataframe
# axis = 0 # axe des index des lignes (dit aussi vertical car les index des lignes sont sur l'axe vertical)
# axis = 1 # axe des index des colonnes (dit aussi horizontal car les index des colonnes sont sur l'axe horizontal)
#
# Toutefois, certaines fonctions commme la fonction drop() utilisent l'axe 0 pour supprimer des lignes et l'axe 1 pour supprimer des colonnes.

In [46]:
# Ajouter une colonne au dataframe
category = google_apps_index["Category"]
category.head()

App
Photo Editor & Candy Camera & Grid & ScrapBook        ART_AND_DESIGN
Coloring book moana                                   ART_AND_DESIGN
U Launcher Lite – FREE Live Cool Themes, Hide Apps    ART_AND_DESIGN
Sketch - Draw & Paint                                 ART_AND_DESIGN
Pixel Draw - Number Art Coloring Book                 ART_AND_DESIGN
Name: Category, dtype: object

In [47]:
# On ajoute la colonne "Category_bis" à la fin du dataframe
google_apps_index["Category_bis"] = category
google_apps_index.head()

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Category_bis
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,ART_AND_DESIGN
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,ART_AND_DESIGN
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,ART_AND_DESIGN
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,ART_AND_DESIGN
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,ART_AND_DESIGN


In [48]:
# Supprimer les deux dernières colonnes
google_apps_index.drop(["Category_bis", "Android Ver"], axis=1) # axis=1 pour supprimer des colonnes avec drop()

Unnamed: 0_level_0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1
...,...,...,...,...,...,...,...,...,...,...,...
Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48
Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0
Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0
The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device


<font size="6" color="red">>>> Explorer un dataframe >>></font>

<font size="5" color="red">> On revient au tableau initial pour la suite de l'analyse. <</font>

In [49]:
# Afficher le nombre de lignes et de colonnes du dataframe
google_play_store.shape

(10841, 13)

In [50]:
# Afficher les noms des colonnes du dataframe
google_play_store.columns 

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [51]:
# Afficher le nombre de valeurs manquantes dans la colonne "Rating"
google_play_store["Rating"].isna().sum()  # sum() permet de compter le nombre de valeurs True

1474

In [52]:
# Afficher le nombre de valeurs manquantes dans chaque colonne
google_play_store.isna().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [53]:
 # Supprimer les lignes contenant des valeurs manquantes (nettoyage des données)
google_play_store.dropna() # ne modifie pas le tableau d'origine

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.1 and up
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


<font size="5" color="red">> Nouvelle variable 'google_play_store_sans_NaN' <</font>

In [54]:
# Enregistrer le nouveau tableau sans NaN dans une nouvelle variable
google_play_store_sans_NaN = google_play_store.dropna()

In [55]:
# Vérifier que le nombre de valeurs manquantes est nul
google_play_store_sans_NaN.isna().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

In [56]:
# Vérifier la dimension du dataframe après suppression des valeurs manquantes
google_play_store_sans_NaN.shape

(9360, 13)

In [57]:
# Trier par ordre croissant le tableau selon la variable "Rating"
google_play_store_sans_NaN.sort_values(by="Rating", ascending=True) # le tableau d'origine n'est pas modifié

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
625,House party - live chat,DATING,1.0,1,9.2M,10+,Free,0,Mature 17+,Dating,"July 31, 2018",3.52,4.0.3 and up
6319,BJ Bridge Standard American 2018,GAME,1.0,1,4.9M,"1,000+",Free,0,Everyone,Card,"May 21, 2018",6.2-sayc,4.0 and up
7926,Tech CU Card Manager,FINANCE,1.0,2,7.2M,"1,000+",Free,0,Everyone,Finance,"July 25, 2017",1.0.1,4.0 and up
7383,Thistletown CI,PRODUCTIVITY,1.0,1,6.6M,100+,Free,0,Everyone,Productivity,"March 15, 2018",41.9,4.1 and up
5978,Truck Driving Test Class 3 BC,FAMILY,1.0,1,2.0M,50+,Paid,$1.49,Everyone,Education,"April 9, 2012",1.0,2.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5108,Lakeside AG Moultrie,LIFESTYLE,5.0,3,8.6M,50+,Free,0,Everyone,Lifestyle,"May 23, 2017",1.0,4.1 and up
5064,Tafsiir Quraan MP3 Af Soomaali Quraanka Kariimka,LIFESTYLE,5.0,7,3.4M,"1,000+",Free,0,Everyone,Lifestyle,"June 9, 2018",1.4,4.0 and up
8327,The Divine Feminine App: the DF App,LIFESTYLE,5.0,8,6.7M,"1,000+",Free,0,Everyone,Lifestyle,"May 16, 2016",0.0.4,4.1 and up
5196,AI Today : Artificial Intelligence News & AI 101,NEWS_AND_MAGAZINES,5.0,43,2.3M,100+,Free,0,Everyone,News & Magazines,"June 22, 2018",1.0,4.4 and up


In [58]:
# Afficher les statistiques descriptives du dataframe
google_play_store_sans_NaN.describe()  

Unnamed: 0,Rating
count,9360.0
mean,4.191838
std,0.515263
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,5.0


In [59]:
# Afficher les types de données des colonnes
google_play_store_sans_NaN.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

NB: 'Rating' est la seule variable numérique du dataset.

In [60]:
# Afficher les statistiques descriptives de toutes les colonnes
google_play_store_sans_NaN.describe(include='all') 

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
count,9360,9360,9360.0,9360.0,9360,9360,9360,9360.0,9360,9360,9360,9360,9360
unique,8190,33,,5990.0,413,19,2,73.0,6,115,1299,2638,31
top,ROBLOX,FAMILY,,2.0,Varies with device,"1,000,000+",Free,0.0,Everyone,Tools,"August 3, 2018",Varies with device,4.1 and up
freq,9,1746,,83.0,1637,1576,8715,8715.0,7414,732,319,1415,2059
mean,,,4.191838,,,,,,,,,,
std,,,0.515263,,,,,,,,,,
min,,,1.0,,,,,,,,,,
25%,,,4.0,,,,,,,,,,
50%,,,4.3,,,,,,,,,,
75%,,,4.5,,,,,,,,,,


In [61]:
# Convertir la colonne "Installs" en entier
google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].str.replace("+", "") 
# google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].apply(lambda x: x.replace("+", ""))
google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].str.replace(",", "")
google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].astype(int)  # conversion en entier
# google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].apply(lambda x: int(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].str.replace("+", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_play_store_sans_NaN["Installs"] = google_play_store_sans_NaN["Installs"].str.replace(",", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  goo

<font color="red">NB : Ne pas tenir compte du warning ci-dessus</font>

In [62]:
# Vérifier du type de données de la colonne "Installs"
google_play_store_sans_NaN.dtypes 

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs            int32
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [63]:
# Afficher les statistiques descriptives
google_play_store_sans_NaN.describe()

Unnamed: 0,Rating,Installs
count,9360.0,9360.0
mean,4.191838,17908750.0
std,0.515263,91266370.0
min,1.0,1.0
25%,4.0,10000.0
50%,4.3,500000.0
75%,4.5,5000000.0
max,5.0,1000000000.0


In [64]:
# Convertir la colonne "Reviews" en entier
google_play_store_sans_NaN["Reviews"] = google_play_store_sans_NaN["Reviews"].astype(int) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_play_store_sans_NaN["Reviews"] = google_play_store_sans_NaN["Reviews"].astype(int)


<font color="red">NB : Ne pas tenir compte du warning ci-dessus</font>

In [65]:
# Afficher les statistiques descriptives
google_play_store_sans_NaN.describe()

Unnamed: 0,Rating,Reviews,Installs
count,9360.0,9360.0,9360.0
mean,4.191838,514376.7,17908750.0
std,0.515263,3145023.0,91266370.0
min,1.0,1.0,1.0
25%,4.0,186.75,10000.0
50%,4.3,5955.0,500000.0
75%,4.5,81627.5,5000000.0
max,5.0,78158310.0,1000000000.0


In [66]:
# Afficher le nombre d'applications gratuites et payantes
google_play_store_sans_NaN["Type"].value_counts()

Type
Free    8715
Paid     645
Name: count, dtype: int64

<font size="6" color="red">>>> Filtrer un dataframe selon des conditions >>></font>

In [67]:
google_play_store_sans_NaN["Rating"] > 4.5 # renvoie une série de booléens (True si la condition est vérifiée, False sinon)

0        False
1        False
2         True
3        False
4        False
         ...  
10834    False
10836    False
10837     True
10839    False
10840    False
Name: Rating, Length: 9360, dtype: bool

In [68]:
# Sélectionner les lignes avec Rating > 4.5
# On utilise la méthode loc() pour sélectionner les lignes de la série 'Rating' pour les valeurs de Rating > 4.5
google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Rating"] > 4.5, :] 
# google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Rating"] > 4.5]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,10000,Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
13,Mandala Coloring Book,ART_AND_DESIGN,4.6,4326,21M,100000,Free,0,Everyone,Art & Design,"June 26, 2018",1.0.4,4.4 and up
16,Photo Designer - Write your name with shapes,ART_AND_DESIGN,4.7,3632,5.5M,500000,Free,0,Everyone,Art & Design,"July 31, 2018",3.1,4.1 and up
19,ibis Paint X,ART_AND_DESIGN,4.6,224399,31M,10000000,Free,0,Everyone,Art & Design,"July 30, 2018",5.5.4,4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10810,Fr Lupupa Sermons,BUSINESS,4.8,19,21M,100,Free,0,Everyone,Business,"June 12, 2018",1.0,4.4 and up
10820,Fr. Daoud Lamei,FAMILY,5.0,22,8.6M,1000,Free,0,Teen,Education,"June 27, 2018",3.8.0,4.1 and up
10829,Bulgarian French Dictionary Fr,BOOKS_AND_REFERENCE,4.6,603,7.4M,10000,Free,0,Everyone,Books & Reference,"June 19, 2016",2.96,4.1 and up
10833,Chemin (fr),BOOKS_AND_REFERENCE,4.8,44,619k,1000,Free,0,Everyone,Books & Reference,"March 23, 2014",0.8,2.2 and up


In [69]:
# Sélectionner les applications de la catégorie 'BUSINESS'
# On utilise la méthode loc() pour sélectionner les lignes de la série 'Category' pour les valeurs de Category = "BUSINESS"
google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Category"] == "BUSINESS", :] 

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
187,Visual Voicemail by MetroPCS,BUSINESS,4.1,16129,Varies with device,10000000,Free,0,Everyone,Business,"July 30, 2018",Varies with device,Varies with device
188,Indeed Job Search,BUSINESS,4.3,674730,Varies with device,50000000,Free,0,Everyone,Business,"May 21, 2018",Varies with device,Varies with device
189,Uber Driver,BUSINESS,4.4,1254730,Varies with device,10000000,Free,0,Everyone,Business,"August 3, 2018",Varies with device,Varies with device
190,ADP Mobile Solutions,BUSINESS,4.3,85185,29M,5000000,Free,0,Everyone,Business,"July 17, 2018",3.4.2,5.0 and up
191,Snag - Jobs Hiring Now,BUSINESS,4.3,32584,Varies with device,1000000,Free,0,Everyone,Business,"May 4, 2018",Varies with device,Varies with device
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10624,Employ Florida Mobile,BUSINESS,2.9,97,9.9M,10000,Free,0,Everyone,Business,"June 13, 2018",4.5.9,4.0.3 and up
10659,FN,BUSINESS,5.0,14,3.3M,50,Free,0,Everyone,Business,"February 1, 2018",1.0,4.0 and up
10706,Neon Blue Gaming Wallpaper&Theme fo Lenovo K8 ...,BUSINESS,4.6,7,2.0M,500,Free,0,Everyone,Business,"August 24, 2017",1.0.0,2.3.3 and up
10754,SCM FPS Status,BUSINESS,4.2,123,3.3M,10000,Free,0,Everyone,Business,"April 28, 2018",3.5,2.3 and up


In [70]:
# Sélectionner les applications de la catégorie 'BUSINESS' avec Rating > 4
google_play_store_sans_NaN.loc[(google_play_store_sans_NaN["Category"] == "BUSINESS") & (google_play_store_sans_NaN["Rating"] > 4), :]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
187,Visual Voicemail by MetroPCS,BUSINESS,4.1,16129,Varies with device,10000000,Free,0,Everyone,Business,"July 30, 2018",Varies with device,Varies with device
188,Indeed Job Search,BUSINESS,4.3,674730,Varies with device,50000000,Free,0,Everyone,Business,"May 21, 2018",Varies with device,Varies with device
189,Uber Driver,BUSINESS,4.4,1254730,Varies with device,10000000,Free,0,Everyone,Business,"August 3, 2018",Varies with device,Varies with device
190,ADP Mobile Solutions,BUSINESS,4.3,85185,29M,5000000,Free,0,Everyone,Business,"July 17, 2018",3.4.2,5.0 and up
191,Snag - Jobs Hiring Now,BUSINESS,4.3,32584,Varies with device,1000000,Free,0,Everyone,Business,"May 4, 2018",Varies with device,Varies with device
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10529,FK CLASSIC FOR YOU,BUSINESS,5.0,1,3.5M,10,Free,0,Everyone,Business,"February 20, 2018",1.1.0,4.0 and up
10659,FN,BUSINESS,5.0,14,3.3M,50,Free,0,Everyone,Business,"February 1, 2018",1.0,4.0 and up
10706,Neon Blue Gaming Wallpaper&Theme fo Lenovo K8 ...,BUSINESS,4.6,7,2.0M,500,Free,0,Everyone,Business,"August 24, 2017",1.0.0,2.3.3 and up
10754,SCM FPS Status,BUSINESS,4.2,123,3.3M,10000,Free,0,Everyone,Business,"April 28, 2018",3.5,2.3 and up


In [71]:
# Sélectionner les applications de la catégorie 'BUSINESS' avec Rating > 4 et Type = "Free"
google_play_store_sans_NaN.loc[(google_play_store_sans_NaN["Category"] == "BUSINESS") & (google_play_store_sans_NaN["Rating"] > 4) & (google_play_store_sans_NaN["Type"] == "Free"), :]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
187,Visual Voicemail by MetroPCS,BUSINESS,4.1,16129,Varies with device,10000000,Free,0,Everyone,Business,"July 30, 2018",Varies with device,Varies with device
188,Indeed Job Search,BUSINESS,4.3,674730,Varies with device,50000000,Free,0,Everyone,Business,"May 21, 2018",Varies with device,Varies with device
189,Uber Driver,BUSINESS,4.4,1254730,Varies with device,10000000,Free,0,Everyone,Business,"August 3, 2018",Varies with device,Varies with device
190,ADP Mobile Solutions,BUSINESS,4.3,85185,29M,5000000,Free,0,Everyone,Business,"July 17, 2018",3.4.2,5.0 and up
191,Snag - Jobs Hiring Now,BUSINESS,4.3,32584,Varies with device,1000000,Free,0,Everyone,Business,"May 4, 2018",Varies with device,Varies with device
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10529,FK CLASSIC FOR YOU,BUSINESS,5.0,1,3.5M,10,Free,0,Everyone,Business,"February 20, 2018",1.1.0,4.0 and up
10659,FN,BUSINESS,5.0,14,3.3M,50,Free,0,Everyone,Business,"February 1, 2018",1.0,4.0 and up
10706,Neon Blue Gaming Wallpaper&Theme fo Lenovo K8 ...,BUSINESS,4.6,7,2.0M,500,Free,0,Everyone,Business,"August 24, 2017",1.0.0,2.3.3 and up
10754,SCM FPS Status,BUSINESS,4.2,123,3.3M,10000,Free,0,Everyone,Business,"April 28, 2018",3.5,2.3 and up


<font size="6" color="red">>>> Grouper un dataframe sur une ou plusieurs colonnes (groupeby) >>></font>

In [72]:
# Afficher les valeurs uniques de la colonne "Price"
google_play_store_sans_NaN["Price"].unique() 

array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49',
       '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00',
       '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99',
       '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99',
       '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88',
       '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77',
       '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00',
       '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04',
       '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90',
       '$1.97', '$2.56', '$1.20'], dtype=object)

In [73]:
google_play_store_sans_NaN.dtypes

App                object
Category           object
Rating            float64
Reviews             int32
Size               object
Installs            int32
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [74]:
# Convertir la colonne "Price" en float
google_play_store_sans_NaN["Price"] = google_play_store_sans_NaN["Price"].apply(lambda x: x.replace("$", ""))
google_play_store_sans_NaN["Price"] = google_play_store_sans_NaN["Price"].apply(lambda x: float(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_play_store_sans_NaN["Price"] = google_play_store_sans_NaN["Price"].apply(lambda x: x.replace("$", ""))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_play_store_sans_NaN["Price"] = google_play_store_sans_NaN["Price"].apply(lambda x: float(x))


<font color="red">NB : Ne pas tenir compte du warning ci-dessus</font>

In [75]:
google_play_store_sans_NaN.dtypes

App                object
Category           object
Rating            float64
Reviews             int32
Size               object
Installs            int32
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [76]:
google_play_store_sans_NaN["Price"].unique() 

array([  0.  ,   4.99,   3.99,   6.99,   7.99,   5.99,   2.99,   3.49,
         1.99,   9.99,   7.49,   0.99,   9.  ,   5.49,  10.  ,  24.99,
        11.99,  79.99,  16.99,  14.99,  29.99,  12.99,   2.49,  10.99,
         1.5 ,  19.99,  15.99,  33.99,  39.99,   3.95,   4.49,   1.7 ,
         8.99,   1.49,   3.88, 399.99,  17.99, 400.  ,   3.02,   1.76,
         4.84,   4.77,   1.61,   2.5 ,   1.59,   6.49,   1.29, 299.99,
       379.99,  37.99,  18.99, 389.99,   8.49,   1.75,  14.  ,   2.  ,
         3.08,   2.59,  19.4 ,   3.9 ,   4.59,  15.46,   3.04,  13.99,
         4.29,   3.28,   4.6 ,   1.  ,   2.95,   2.9 ,   1.97,   2.56,
         1.2 ])

In [77]:
# Afficher les statistiques par catégorie
google_play_store_sans_NaN.groupby("Category").describe()

Unnamed: 0_level_0,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Reviews,Reviews,...,Installs,Installs,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
ART_AND_DESIGN,61.0,4.377049,0.328326,3.4,4.1,4.4,4.7,5.0,61.0,28103.56,...,500000.0,50000000.0,61.0,0.097869,0.433898,0.0,0.0,0.0,0.0,1.99
AUTO_AND_VEHICLES,73.0,4.190411,0.543692,2.1,4.0,4.3,4.6,4.9,73.0,15940.14,...,500000.0,10000000.0,73.0,0.02726,0.232912,0.0,0.0,0.0,0.0,1.99
BEAUTY,42.0,4.278571,0.362603,3.1,4.0,4.3,4.575,4.9,42.0,9407.929,...,500000.0,10000000.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BOOKS_AND_REFERENCE,178.0,4.346067,0.429046,2.7,4.1,4.5,4.6,5.0,178.0,123363.3,...,1000000.0,1000000000.0,178.0,0.134157,0.674247,0.0,0.0,0.0,0.0,4.6
BUSINESS,303.0,4.121452,0.624422,1.0,3.9,4.3,4.5,5.0,303.0,46053.09,...,1000000.0,100000000.0,303.0,0.245512,1.533832,0.0,0.0,0.0,0.0,17.99
COMICS,58.0,4.155172,0.537758,2.8,3.825,4.4,4.5,5.0,58.0,58309.4,...,1000000.0,10000000.0,58.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
COMMUNICATION,328.0,4.158537,0.426192,1.0,4.0,4.3,4.4,5.0,328.0,2486164.0,...,50000000.0,1000000000.0,328.0,0.172835,0.739936,0.0,0.0,0.0,0.0,4.99
DATING,195.0,3.970769,0.63051,1.0,3.7,4.1,4.4,5.0,195.0,37389.94,...,1000000.0,10000000.0,195.0,0.117744,0.855055,0.0,0.0,0.0,0.0,7.99
EDUCATION,155.0,4.389032,0.251894,3.5,4.2,4.4,4.6,4.9,155.0,255451.7,...,5000000.0,100000000.0,155.0,0.115871,0.72774,0.0,0.0,0.0,0.0,5.99
ENTERTAINMENT,149.0,4.126174,0.302556,3.0,3.9,4.2,4.3,4.7,149.0,397168.8,...,10000000.0,1000000000.0,149.0,0.053557,0.475144,0.0,0.0,0.0,0.0,4.99


In [78]:
# Afficher les statistiques par catégorie et par genre
google_play_store_sans_NaN.groupby(["Category", "Genres"]).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Reviews,Reviews,...,Installs,Installs,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Category,Genres,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
ART_AND_DESIGN,Art & Design,55.0,4.380000,0.321685,3.4,4.15,4.4,4.700,5.0,55.0,30575.472727,...,500000.0,5.000000e+07,55.0,0.108545,0.456076,0.0,0.0,0.0,0.0,1.99
ART_AND_DESIGN,Art & Design;Creativity,5.0,4.440000,0.397492,3.8,4.30,4.7,4.700,4.7,5.0,6339.800000,...,500000.0,5.000000e+05,5.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00
ART_AND_DESIGN,Art & Design;Pretend Play,1.0,3.900000,,3.9,3.90,3.9,3.900,3.9,1.0,967.000000,...,500000.0,5.000000e+05,1.0,0.000000,,0.0,0.0,0.0,0.0,0.00
AUTO_AND_VEHICLES,Auto & Vehicles,73.0,4.190411,0.543692,2.1,4.00,4.3,4.600,4.9,73.0,15940.136986,...,500000.0,1.000000e+07,73.0,0.027260,0.232912,0.0,0.0,0.0,0.0,1.99
BEAUTY,Beauty,42.0,4.278571,0.362603,3.1,4.00,4.3,4.575,4.9,42.0,9407.928571,...,500000.0,1.000000e+07,42.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TRAVEL_AND_LOCAL,Travel & Local;Action & Adventure,1.0,4.100000,,4.1,4.10,4.1,4.100,4.1,1.0,890.000000,...,100000.0,1.000000e+05,1.0,0.000000,,0.0,0.0,0.0,0.0,0.00
VIDEO_PLAYERS,Video Players & Editors,158.0,4.063924,0.554566,1.8,3.80,4.2,4.400,4.9,158.0,696840.936709,...,10000000.0,1.000000e+09,158.0,0.066203,0.519357,0.0,0.0,0.0,0.0,5.99
VIDEO_PLAYERS,Video Players & Editors;Creativity,1.0,4.100000,,4.1,4.10,4.1,4.100,4.1,1.0,159622.000000,...,5000000.0,5.000000e+06,1.0,0.000000,,0.0,0.0,0.0,0.0,0.00
VIDEO_PLAYERS,Video Players & Editors;Music & Video,1.0,4.000000,,4.0,4.00,4.0,4.000,4.0,1.0,119202.000000,...,10000000.0,1.000000e+07,1.0,0.000000,,0.0,0.0,0.0,0.0,0.00


## <font color="red">Problème à résoudre de la capsule 52</font>
**Objectif : Etudier le marché Android via le Google Play Store**
1. Quels sont les noms, les catégories et les genres des applications les plus installées sur Google Play ?
2. Quel est le plus grand nombre de Reviews ? (avis) Quel est le nom et la note globale de l’application concernée ?
3. Afficher les colonnes 2,5,6 et les lignes 3 à 16 du dataframe
4. Combien y a-t-il d’applications ouvertes aux personnes de tous âge ? (Everyone)
5. Dans quelle catégorie y a-t-il le plus d’applications ?
6. Quelle est l’application vendue la plus chère ?

In [79]:
# 1. Quels sont les noms, les catégories et les genres des applications les plus installées sur Google Play ?
# Déterminer au préalable le nombre d'installations maximum
max_installs = google_play_store_sans_NaN["Installs"].max()
print("Nombre d'installations maximum : ", max_installs)

# Sélectionner les applications les plus installées et afficher les colonnes "App", "Category" et "Genres"
google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Installs"] == max_installs, ["App", "Category", "Genres"]]

Nombre d'installations maximum :  1000000000


Unnamed: 0,App,Category,Genres
152,Google Play Books,BOOKS_AND_REFERENCE,Books & Reference
335,Messenger – Text and Video Chat for Free,COMMUNICATION,Communication
336,WhatsApp Messenger,COMMUNICATION,Communication
338,Google Chrome: Fast & Secure,COMMUNICATION,Communication
340,Gmail,COMMUNICATION,Communication
341,Hangouts,COMMUNICATION,Communication
381,WhatsApp Messenger,COMMUNICATION,Communication
382,Messenger – Text and Video Chat for Free,COMMUNICATION,Communication
386,Hangouts,COMMUNICATION,Communication
391,Skype - free IM & video calls,COMMUNICATION,Communication


In [80]:
# 2. Quel est le plus grand nombre de Reviews (avis en français) ? Quel est le nom et la note globale (Rating) de l'application concernée ?
max_reviews = google_play_store_sans_NaN["Reviews"].max()
print("Nombre de Reviews maximum : ", max_reviews)
google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Reviews"] == max_reviews, ["App", "Rating"]]

Nombre de Reviews maximum :  78158306


Unnamed: 0,App,Rating
2544,Facebook,4.1


In [81]:
# 3. Afficher les colonnes 2,5,6 et les lignes 3 à 16 du dataframe
google_play_store_sans_NaN.iloc[2:16, [1, 4, 5]] # les index vont de 2 à 15 pour les lignes, les index sont de 1, 4 et 5 pour les colonnes
# Ne pas confondre les index des numéros de lignes et de colonnes

Unnamed: 0,Category,Size,Installs
2,ART_AND_DESIGN,8.7M,5000000
3,ART_AND_DESIGN,25M,50000000
4,ART_AND_DESIGN,2.8M,100000
5,ART_AND_DESIGN,5.6M,50000
6,ART_AND_DESIGN,19M,50000
7,ART_AND_DESIGN,29M,1000000
8,ART_AND_DESIGN,33M,1000000
9,ART_AND_DESIGN,3.1M,10000
10,ART_AND_DESIGN,28M,1000000
11,ART_AND_DESIGN,12M,1000000


In [82]:
# 4. Combien y a-t-il d'applications ouvertes aux personnes de tous âge (Everyone) ?
apps_for_everyone = google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Content Rating"] == "Everyone", "Content Rating"]  # sélection par slicing
apps_for_everyone.count()

7414

In [83]:
# Deuxième approche
result = google_play_store_sans_NaN["Content Rating"].value_counts() # affiche le nombre d'applications pour chaque catégorie d'âge
print("Nombre d'applications par catégorie d'âge : ")
print(result)
print("")
print("Nombre d'applications ouvertes à tous (Everyone) : ", result["Everyone"])

Nombre d'applications par catégorie d'âge : 
Content Rating
Everyone           7414
Teen               1084
Mature 17+          461
Everyone 10+        397
Adults only 18+       3
Unrated               1
Name: count, dtype: int64

Nombre d'applications ouvertes à tous (Everyone) :  7414


In [84]:
# 5. Dans quelle catégorie y a-t-il le plus d'applications ?
result = google_play_store_sans_NaN["Category"].value_counts()
print("Total des applications par catégorie : ")
print(result.head(10))
print("")
print("Catégorie avec le plus d'applications : ", result.idxmax()) # idxmax() renvoie l'index de la valeur maximale

Total des applications par catégorie : 
Category
FAMILY           1746
GAME             1097
TOOLS             733
PRODUCTIVITY      351
MEDICAL           350
COMMUNICATION     328
FINANCE           323
SPORTS            319
PHOTOGRAPHY       317
LIFESTYLE         314
Name: count, dtype: int64

Catégorie avec le plus d'applications :  FAMILY


In [85]:
# 6. Quelle est l'application vendue la plus chère ?
max_price = google_play_store_sans_NaN["Price"].max()
google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Price"] == max_price]
# google_play_store_sans_NaN.loc[google_play_store_sans_NaN["Price"] == max_price, ["App"]]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
4367,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3M,10000,Paid,400.0,Everyone,Lifestyle,"May 3, 2018",1.0.1,4.1 and up
