# Ex2 - Filtering and Sorting Data

This time we are going to load the data directly from the internet.

#### Step 1. Load the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv) directly and assign it to a variable called `euro12`.

In [1]:
import pandas as pd
filename = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv'
euro12 = pd.read_csv(filename)

euro12.head()

Unnamed: 0,Team,Goals,Shots on target,Shots off target,Shooting Accuracy,% Goals-to-shots,Total shots (inc. Blocked),Hit Woodwork,Penalty goals,Penalties not scored,...,Saves made,Saves-to-shots ratio,Fouls Won,Fouls Conceded,Offsides,Yellow Cards,Red Cards,Subs on,Subs off,Players Used
0,Croatia,4,13,12,51.9%,16.0%,32,0,0,0,...,13,81.3%,41,62,2,9,0,9,9,16
1,Czech Republic,4,13,18,41.9%,12.9%,39,0,0,0,...,9,60.1%,53,73,8,7,0,11,11,19
2,Denmark,4,10,10,50.0%,20.0%,27,1,0,0,...,10,66.7%,25,38,8,4,0,7,7,15
3,England,5,11,18,50.0%,17.2%,40,0,0,0,...,22,88.1%,43,45,6,5,0,11,11,16
4,France,3,22,24,37.9%,6.5%,65,1,0,0,...,6,54.6%,36,51,5,6,0,11,11,19


#### 2. What is the number of columns in the dataset?

In [2]:
num_col = euro12.shape[1]
print(f"The amount of columns is: {num_col}")

The amount of columns is: 35


#### 3. How many columns have numbers stored as strings?

In [3]:
print(euro12.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Team                        16 non-null     object 
 1   Goals                       16 non-null     int64  
 2   Shots on target             16 non-null     int64  
 3   Shots off target            16 non-null     int64  
 4   Shooting Accuracy           16 non-null     object 
 5   % Goals-to-shots            16 non-null     object 
 6   Total shots (inc. Blocked)  16 non-null     int64  
 7   Hit Woodwork                16 non-null     int64  
 8   Penalty goals               16 non-null     int64  
 9   Penalties not scored        16 non-null     int64  
 10  Headed goals                16 non-null     int64  
 11  Passes                      16 non-null     int64  
 12  Passes completed            16 non-null     int64  
 13  Passing Accuracy            16 non-nu

#### 4. Standardize column names: convert to lowercase, remove special characters and replace spaces and dashes with underscores

In [4]:
euro12.columns = euro12.columns.str.lower()
euro12.columns = euro12.columns.str.replace('[#,@,&,),(,%,.]', "")
euro12.columns = euro12.columns.str.replace('[ , -]', "_")
euro12.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   team                     16 non-null     object 
 1   goals                    16 non-null     int64  
 2   shots_on_target          16 non-null     int64  
 3   shots_off_target         16 non-null     int64  
 4   shooting_accuracy        16 non-null     object 
 5   _goals_to_shots          16 non-null     object 
 6   total_shots_inc_blocked  16 non-null     int64  
 7   hit_woodwork             16 non-null     int64  
 8   penalty_goals            16 non-null     int64  
 9   penalties_not_scored     16 non-null     int64  
 10  headed_goals             16 non-null     int64  
 11  passes                   16 non-null     int64  
 12  passes_completed         16 non-null     int64  
 13  passing_accuracy         16 non-null     object 
 14  touches                  16 

  euro12.columns = euro12.columns.str.replace('[#,@,&,),(,%,.]', "")
  euro12.columns = euro12.columns.str.replace('[ , -]', "_")


#### 5. Keep only `Team`, `Goals`, `Shooting Accuracy`, `Offsides`, `Red Cards` and `Yellow Cards` features

In [5]:
features = ['team', 'goals', 'shooting_accuracy', 'offsides', 'red_cards', 'yellow_cards']
subset = euro12[features]

subset.head()

Unnamed: 0,team,goals,shooting_accuracy,offsides,red_cards,yellow_cards
0,Croatia,4,51.9%,2,0,9
1,Czech Republic,4,41.9%,8,0,7
2,Denmark,4,50.0%,8,0,4
3,England,5,50.0%,6,0,5
4,France,3,37.9%,5,0,6


#### 6. Select only columns from the previous task at LOAD time already

In [6]:
subset_load = pd.read_csv(filename, usecols = ['Team', 'Goals', 'Shooting Accuracy', 'Offsides', 'Red Cards', 'Yellow Cards'])
subset_load.head()

Unnamed: 0,Team,Goals,Shooting Accuracy,Offsides,Yellow Cards,Red Cards
0,Croatia,4,51.9%,2,9,0
1,Czech Republic,4,41.9%,8,7,0
2,Denmark,4,50.0%,8,4,0
3,England,5,50.0%,6,5,0
4,France,3,37.9%,5,6,0


#### 7. Convert `Shooting Accuracy` to float

In [7]:
euro12['shooting_accuracy'] = euro12['shooting_accuracy'].str.replace('%', '')
euro12['shooting_accuracy'] = euro12['shooting_accuracy'].astype('float')
euro12.head()

Unnamed: 0,team,goals,shots_on_target,shots_off_target,shooting_accuracy,_goals_to_shots,total_shots_inc_blocked,hit_woodwork,penalty_goals,penalties_not_scored,...,saves_made,saves_to_shots_ratio,fouls_won,fouls_conceded,offsides,yellow_cards,red_cards,subs_on,subs_off,players_used
0,Croatia,4,13,12,51.9,16.0%,32,0,0,0,...,13,81.3%,41,62,2,9,0,9,9,16
1,Czech Republic,4,13,18,41.9,12.9%,39,0,0,0,...,9,60.1%,53,73,8,7,0,11,11,19
2,Denmark,4,10,10,50.0,20.0%,27,1,0,0,...,10,66.7%,25,38,8,4,0,7,7,15
3,England,5,11,18,50.0,17.2%,40,0,0,0,...,22,88.1%,43,45,6,5,0,11,11,16
4,France,3,22,24,37.9,6.5%,65,1,0,0,...,6,54.6%,36,51,5,6,0,11,11,19


#### 8. Execute the required conversion at load time already

In [8]:
converted = pd.read_csv(filename, converters = {'Shooting Accuracy': (lambda x: float(x.replace('%','')))},
                        dtype = {'Shooting Accuracy': 'float'})
converted.info()

  return func(*args, **kwargs)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Team                        16 non-null     object 
 1   Goals                       16 non-null     int64  
 2   Shots on target             16 non-null     int64  
 3   Shots off target            16 non-null     int64  
 4   Shooting Accuracy           16 non-null     float64
 5   % Goals-to-shots            16 non-null     object 
 6   Total shots (inc. Blocked)  16 non-null     int64  
 7   Hit Woodwork                16 non-null     int64  
 8   Penalty goals               16 non-null     int64  
 9   Penalties not scored        16 non-null     int64  
 10  Headed goals                16 non-null     int64  
 11  Passes                      16 non-null     int64  
 12  Passes completed            16 non-null     int64  
 13  Passing Accuracy            16 non-nu

#### 9. Take only the columns `Team`, `Yellow Cards` and `Red Cards` and assign them to a dataframe called discipline
Note: Make sure that changes in the DataFrame `discipline` don't affect the original DataFrame!

In [9]:
cols = ['team', 'yellow_cards', 'red_cards']
discipline = euro12[cols]

discipline.head()

Unnamed: 0,team,yellow_cards,red_cards
0,Croatia,9,0
1,Czech Republic,7,0
2,Denmark,4,0
3,England,5,0
4,France,6,0


#### 10. Sort the teams in `discipline` by `Red Cards`, then by `Yellow Cards`, in a descending manner

In [10]:
discipline = discipline.sort_values(['red_cards', 'yellow_cards'], ascending = False)

discipline.head()

Unnamed: 0,team,yellow_cards,red_cards
6,Greece,9,1
9,Poland,7,1
11,Republic of Ireland,6,1
7,Italy,16,0
10,Portugal,12,0


#### 11. Filter teams that scored more than 6 goals

In [11]:
high_scor_teams = euro12[euro12['goals'] > 6]

high_scor_teams.head()

Unnamed: 0,team,goals,shots_on_target,shots_off_target,shooting_accuracy,_goals_to_shots,total_shots_inc_blocked,hit_woodwork,penalty_goals,penalties_not_scored,...,saves_made,saves_to_shots_ratio,fouls_won,fouls_conceded,offsides,yellow_cards,red_cards,subs_on,subs_off,players_used
5,Germany,10,32,32,47.8,15.6%,80,2,1,0,...,10,62.6%,63,49,12,4,0,15,15,17
13,Spain,12,42,33,55.9,16.0%,100,0,1,0,...,15,93.8%,102,83,19,11,0,17,17,18


#### 12. Show only teams that start with "G"

In [12]:
g_teams = euro12[euro12['team'].str.startswith('G')]

g_teams

Unnamed: 0,team,goals,shots_on_target,shots_off_target,shooting_accuracy,_goals_to_shots,total_shots_inc_blocked,hit_woodwork,penalty_goals,penalties_not_scored,...,saves_made,saves_to_shots_ratio,fouls_won,fouls_conceded,offsides,yellow_cards,red_cards,subs_on,subs_off,players_used
5,Germany,10,32,32,47.8,15.6%,80,2,1,0,...,10,62.6%,63,49,12,4,0,15,15,17
6,Greece,5,8,18,30.7,19.2%,32,1,1,1,...,13,65.1%,67,48,12,9,1,12,12,20
