## Read with Pandas

Now we will look at Pandas **read_csv()** function. Pandas allow us to read alot of different filetypes not only csv. Here you can find a little overview: https://pandas.pydata.org/docs/user_guide/io.html

To get a simple understanding with what we are working with, **head()** can be used to peek at the first 5 columns in the csv.

In [3]:
import pandas as pd

# Google play data
gp_data = pd.read_csv("googleplaystore.csv") 

gp_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## Drop

Sometimes the data is not relevant for us to keep. Let's remove type, price, content rating and versions to make the data a bit more clean and easy to read. We do this using **drop()**. Drop takes some parameters: a list of what we want to remove and wich axis the are located.

We also do not want to have to drop them each time, so let's overwrite the variable *gp_data*.

In [4]:
gp_data = gp_data.drop(["Type", "Price", "Content Rating", "Last Updated", "Current Ver", "Android Ver"], axis=1)

gp_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Art & Design;Creativity


## Count

We can also count in out dataset. For example we can count all the different categories, and how many of them there are.

In [5]:
pd.value_counts(gp_data["Category"])

FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
LIBRARIES_AND_DEMO       85
AUTO_AND_VEHICLES        85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
PARENTING                60
COMICS                   60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64

## Mean

If we want to find the mean of all our numbers, we simply select that row and put a **.mean()** at the end.

In [6]:
gp_data["Rating"].mean()

4.193338315362443

## Group by

We can also select a row by using **groupby()**.

In [11]:
gp_data.groupby(["Category"]).mean()

Unnamed: 0_level_0,Rating
Category,Unnamed: 1_level_1
1.9,19.0
ART_AND_DESIGN,4.358065
AUTO_AND_VEHICLES,4.190411
BEAUTY,4.278571
BOOKS_AND_REFERENCE,4.346067
BUSINESS,4.121452
COMICS,4.155172
COMMUNICATION,4.158537
DATING,3.970769
EDUCATION,4.389032


### We can also filter out some of the data

**Let's filter out apps with rating above 4.**

In [9]:
gp_data[gp_data['Rating']<4.0].groupby(["Category"]).mean()

Unnamed: 0_level_0,Rating
Category,Unnamed: 1_level_1
ART_AND_DESIGN,3.685714
AUTO_AND_VEHICLES,3.466667
BEAUTY,3.742857
BOOKS_AND_REFERENCE,3.572414
BUSINESS,3.29375
COMICS,3.466667
COMMUNICATION,3.495
DATING,3.326087
EDUCATION,3.783333
ENTERTAINMENT,3.741463
