# Exercise 1. - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out [Occupation Exercises Video Tutorial](https://www.youtube.com/watch?v=W8AB5s-L3Rw&list=PLgJhDSE2ZLxaY_DigHeiIDC1cD09rXgJv&index=4) to watch a data scientist go through the exercises

### Step 1. Import the necessary libraries!

In [None]:
import pandas as pd
import seaborn as sns
from dfply import *

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [None]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'  
users= pd.read_csv(url, sep="|")
users.set_index('user_id', inplace=True)

### Step 4. See the first 25 entries

In [None]:
users >> head(25)

### Step 5. See the last 10 entries

In [None]:
users >> tail(10)

### Step 6. What is the number of observations in the dataset?

In [None]:
users.info()

### Step 7. What is the number of columns in the dataset?

In [None]:
print(users.shape[1])

### Step 8. Print the name of all the columns.

In [None]:
users.columns

### Step 9. How is the dataset indexed?

In [None]:
users.index

### Step 10. What is the data type of each column?

In [None]:
users.info()

### Step 11. Print only the occupation column

In [None]:
 users >> select("occupation") >> head(10)


In [None]:
users.occupation.head(10)

### Step 12. How many different occupations are in this dataset?

In [None]:

x= users.occupation.drop_duplicates()
print(x.count())

In [None]:
users.occupation.nunique()

### Step 13. What is the most frequent occupation?

In [None]:
users.occupation.value_counts()

### Step 14. Summarize the DataFrame.

In [None]:
users.describe(include='all')

### Step 15. Summarize all the columns

In [None]:
users.describe(include='all')

### Step 16. Summarize only the occupation column

In [None]:
users.occupation.describe()

### Step 17. What is the mean age of users?

In [None]:
users.age.mean()

### Step 18. What is the age with least occurrence?

In [None]:
users.age.value_counts().tail()

---
End of Exercise1.ipynb
---

# Exercise 2. - Filtering and Sorting Data

Check out [Euro 12 Exercises Video Tutorial](https://youtu.be/iqk5d48Qisg) to watch a data scientist go through the exercises

This time we are going to pull data directly from the internet.

### Step 1. Import the necessary libraries

In [None]:
import pandas as pd
import seaborn as sns
from dfply import *


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/kflisikowsky/pandas_exercises/refs/heads/main/Euro_2012_stats_TEAM.csv). 

### Step 3. Assign it to a variable called euro12.

In [None]:
url ='https://raw.githubusercontent.com/kflisikowsky/pandas_exercises/refs/heads/main/Euro_2012_stats_TEAM.csv'
df=pd.read_csv(url)
euro12 = df
print(euro12.head())

### Step 4. Select only the Goal column.

In [None]:
euro12 >> select(X.Goals)

### Step 5. How many team participated in the Euro2012?

In [None]:
print(euro12.Team.count())

### Step 6. What is the number of columns in the dataset?

In [None]:
euro12.shape[1]

### Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

In [None]:
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline

### Step 8. Sort the teams by Red Cards, then to Yellow Cards

In [None]:
discipline.sort_values(['Red Cards', 'Yellow Cards'] ,ascending = False)

### Step 9. Calculate the mean Yellow Cards given per Team

In [None]:
mean = euro12['Yellow Cards'].mean()
print(mean)

### Step 10. Filter teams that scored more than 6 goals

In [None]:
more_than_6= euro12.query('Goals > 6')
more_than_6

### Step 11. Select the teams that start with G

In [None]:
starts_with_G= euro12.query('Team.str.startswith("G")')
starts_with_G

### Step 12. Select the first 7 columns

In [None]:
euro12.iloc[0:7]

### Step 13. Select all columns except the last 3.

In [None]:
euro12.drop(euro12.index[-3:])

### Step 14. Present only the Shooting Accuracy from England, Italy and Russia

In [None]:
euro12.set_index('Team',inplace=True)
euro12.loc[['England', 'Italy', 'Russia'], ['Shooting Accuracy']]

---
End of Exercise2.ipynb
---

# Exercise 3. - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  

Check out [Alcohol Consumption Exercises Video Tutorial](https://youtu.be/az67CMdmS6s) to watch a data scientist go through the exercises


### Step 1. Import the necessary libraries

In [None]:
import pandas as pd
import seaborn as sns
from dfply import *

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

### Step 3. Assign it to a variable called drinks.

In [None]:
url=('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv')
drinks = pd.read_csv(url)
drinks >>head(15)


### Step 4. Which continent drinks more beer on average?

In [None]:
drinks >>group_by('continent') >>summarize(total_beer_servings=X.beer_servings.mean()) >>arrange(desc(X.total_beer_servings)) 


### Step 5. For each continent print the statistics for wine consumption.

In [None]:


d = drinks.groupby('continent').agg({
'wine_servings': 'describe'
})
d

### Step 6. Print the mean alcohol consumption per continent for every column

In [None]:
drinks.groupby('continent').mean()

### Step 7. Print the median alcohol consumption per continent for every column

In [None]:
drinks.groupby('continent').median()

### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [None]:
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])

---
End of Exercise3.ipynb
---