<a href="https://colab.research.google.com/github/Yanhuijun1911/PythonData/blob/main/2_2_Working_with_Strings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Strings
---

The pandas library has a similar set of string functions to those available in python generally.  Because we often want to perform operations on a whole series of data values in a dataframe, we can use pandas string functions to do this:



### Creating a series from a pandas column
---

A column of data is treated by pandas as a Series.  There is a set of functions that you can access for working with a **series** (just one column from the data table).

To get a 'series' from a dataframe, you would split the column from the rest of the dataframe by taking a copy of it and storing it in a new variable (which is very similar to a list).

The examples below show what you can do with a Series rather than a whole table.

To get a column of data as a series:

```
series_data = df['date']
```
or
```
price_series = df['price']
```
where 'date' or 'price' are the names of the columns in the dataframe

### Splitting data 
---

Series.str.split() *to split a column's strings into components*    
Series.str.get() *to get one of the components after the split*  

You can **daisychain** these together:   

`Series.str.split().str.get()`

* `split()` will split by white space unless specified, for example if you wanted to split by "/" you would use `split("/")`  

* `get()` requires a parameter of the value position of the string you would like to 'get'. If you want the first word eg 1999, use `get(0)`.


*Hint: remember to save your result into a new column* 

### Exercise 1 strings
---

Let's use the data set 'Housing in London' at 'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv'

The date, in this dataset is a string.   To filter for a particular year, we will need to extract the first four letters as a substring.  We can create a new column called **year**, which just contains the year, stored as a number.

The date is written in the format yyyy-mm-dd.  We can split the year around the '-' and then use the first component.


Create a function called **get_year()** which splits the data from the date column, and creates a year column with just the year before returning the year column.


**Test output**:  

```
0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: year, Length: 1071, dtype: object
```

In [None]:
import pandas as pd

url =  'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1071 entries, 0 to 1070
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   code               1071 non-null   object 
 1   area               1071 non-null   object 
 2   date               1071 non-null   object 
 3   median_salary      1049 non-null   float64
 4   life_satisfaction  352 non-null    float64
 5   mean_salary        1071 non-null   object 
 6   recycling_pct      860 non-null    object 
 7   population_size    1018 non-null   float64
 8   number_of_jobs     931 non-null    float64
 9   area_size          666 non-null    float64
 10  no_of_houses       666 non-null    float64
 11  borough_flag       1071 non-null   int64  
dtypes: float64(6), int64(1), object(5)
memory usage: 100.5+ KB


In [None]:


df_date = df['date']
df_date = df_date.str.split('-')
year = df_date.str.get(0)
year

0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: date, Length: 1071, dtype: object

In [None]:
def get_year():
  # add code below to return a new series created from the year column, with just years 
  df_new = df['date'].str.split('-')
  year = df_new.str.get(0)
  
  return year



# run and test if returned series is of correct length and has correct first row 

actual_len = len(get_year())
actual_value = get_year().iloc[0]
expected_len = 1071
expected_val = "1999"

if actual_len == expected_len and actual_value == expected_val:
  print("Test passed expected length 1071 and first value 1999 and got", actual_len, actual_value)
else: 
  print("Test failed expected length 1071 and first value 1999 and got length", actual_len, "value", actual_value)




Test passed expected length 1071 and first value 1999 and got 1071 1999


### Exercise 2
---

In exercise 1 you have extracted the year, but it's dtype is 'object' (it is still a string).  You can convert to integer by adding  .astype(int) to the daisychain.

Create a new function called **get_int_year()**, `return` the year column with values of type int. 

**Test output**:  

```
...
Name: year, Length: 1071, dtype: int64
```



In [None]:
def get_int_year():
  # add code below to return a year column where values are of integer type
 year = df['date'].str.split('-').str.get(0)
 year = year.astype(int)
 return year





# run and test if your returned series is of type int

import numpy as np

actual = get_int_year().dtype
expected = np.int64

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed, expected", expected, "got", actual)

Test passed int64


### Exercise 3
---

All the areas in the data set are in lower case.  To prepare the data for reporting, you may want to capitalise.  Use .str.title() to do this.

Create the function **get_title_areas()** to do this. `Return` the newly capitalised area column 

In [None]:
df.head()

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1


In [None]:
def get_title_areas():
  # add code below to capitalise the first letter of each string in the column 'area'

  area = df['area']
  area = area.str.title()
 
  return area

get_title_areas()



# run and test if the first row of the area column is now correct 

actual = get_title_areas().iloc[0]
expected = "City Of London"

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed, expected", expected, "got", actual)

Test passed City Of London


### Exercise 4 - Filter all areas to find all with 'And' in the name
---

Create a function called **get_and()** which uses `str.contains()` and a search (e.g. df[df['area'].str.contains()]) to filter and `return` all areas with 'And' in the name  (Note:  case is important)

**Test output**:  
105 rows × 13 columns


In [None]:
df.head()

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1


In [None]:
df_area = df['area'].str.contains('and')
df_and = df_area[df_area == True]
df_and.count()
 

231

In [None]:
def get_and():
# add code to return just rows in which area contains 'And'
 df_area = df['area'].str.contains('and')
 df_and = df_area[df_area == True]

 return df_and



# run and test if returned is correct length 

actual = len(get_and())
expected = 105

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed, expected", expected, "got", actual)

Test failed, expected 105 got 231


### Exercise 5
---

Filter the data for all areas starting with 'Ba'  

*hint: use `startswith()`*

**Test Ouput:**  
42 rows, first row has area 'Barking and Dagenham'

In [None]:
area = df['area']
ba = area.str.startswith('ba')
ba.value_counts()

starts_ba = ba[ba ==True] 
starts_ba.value_counts()

True    42
Name: area, dtype: int64

In [None]:
def get_ba():
  # add code to filter for all areas starting with 'Ba' 
  area = df['area']
  ba = area.str.startswith('ba')
  starts_ba = ba[ba ==True]
  return starts_ba



# run and test if your returned rows are the right length 

actual = len(get_ba())
expected = 42 

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed, expected", expected, "got", actual)


Test passed 42


### Exercise 6
---
Create function called **get_ham()** to filter and `return` the data for all areas ending with 'ham', for the year 2000

*hint: use `endswith()`*   

**Test output**:  
4 rows (barking and dagenham, hammersmith and fulham, lewisham, newham)  

In [None]:
df.head()

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1


In [None]:
df_new = df['date'].str.split('-').str.get(0) 
df_new = df_new[df_new == '2000']

df_new

51     2000
52     2000
53     2000
54     2000
55     2000
56     2000
57     2000
58     2000
59     2000
60     2000
61     2000
62     2000
63     2000
64     2000
65     2000
66     2000
67     2000
68     2000
69     2000
70     2000
71     2000
72     2000
73     2000
74     2000
75     2000
76     2000
77     2000
78     2000
79     2000
80     2000
81     2000
82     2000
83     2000
84     2000
85     2000
86     2000
87     2000
88     2000
89     2000
90     2000
91     2000
92     2000
93     2000
94     2000
95     2000
96     2000
97     2000
98     2000
99     2000
100    2000
101    2000
Name: date, dtype: object

In [None]:
ham = df['area'].str.endswith('ham') 
get_ham = ham[ham == True]
get_ham.value_counts()

True    84
Name: area, dtype: int64

In [None]:
df_new = df[(df['area'].str.endswith('ham') == True) & (df['date'].str.split('-').str.get(0) == '2000') ]
df_new

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
52,E09000002,barking and dagenham,2000-12-01,22618.0,,24696,4,163893.0,57000.0,,,1
63,E09000013,hammersmith and fulham,2000-12-01,25264.0,,28742,8,164393.0,120000.0,,,1
73,E09000023,lewisham,2000-12-01,22357.0,,22659,5,252106.0,76000.0,,,1
75,E09000025,newham,2000-12-01,19437.0,,21609,3,245463.0,79000.0,,,1


In [None]:
def get_ham():
  # add code to return rows which end with 'ham' for the year 2000
  df_new = df[(df['area'].str.endswith('ham') == True) & (df['date'].str.split('-').str.get(0) == '2000') ]

  return df_new
    

  



# run and test if correct number of rows are returned

actual = len(get_ham())
expected = 4 

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed, expected", expected, "got", actual)

Test passed 4


### Exercise 7 - new data set
---

Use the data set here:  https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true

Read the data from the sheet 'Skill Migration'  

Write a function called **create_new_df()** which will inspect the data, then create and return new dataframe with the following changes:

1.  Remove the word 'Skills' from the 'skill_group_category' column   
  *hint: you can use the `str.rstrip()` function*
2.  Convert country_code to uppercase    
  *hint: try `upper()`*  
4.  Remove the skill_group_id and the wb_income columns
3.  Filter for regions containing 'Asia'  
  *hint: you might have to `return` it*

**Test output**:  
9969 rows × 10 columns

In [None]:
import pandas as pd

url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
df = pd.read_excel(url,sheet_name = 'Skill Migration')
df.head()

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,af,Afghanistan,Low income,South Asia,2549,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,af,Afghanistan,Low income,South Asia,2608,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,af,Afghanistan,Low income,South Asia,3806,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,af,Afghanistan,Low income,South Asia,50321,Tech Skills,Software Testing,-957.5,-828.54,-964.73,-406.5,-739.51
4,af,Afghanistan,Low income,South Asia,1606,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64


In [None]:
df['skill_group_category'] = df['skill_group_category'].str.rstrip('Skills')
df['country_code'] = df['country_code'].str.upper()
df['wb_region'] = df['wb_region'].str.contains('Asia')
df['wb_region'] = df['wb_region'] = True
df_new = df.drop(columns = ['skill_group_id','wb_income'])


In [None]:

df_new

Unnamed: 0,country_code,country_name,wb_region,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,AF,Afghanistan,True,Tech,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,AF,Afghanistan,True,Business,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,AF,Afghanistan,True,Specialized Industry,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,AF,Afghanistan,True,Tech,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51
4,AF,Afghanistan,True,Specialized Industry,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64
...,...,...,...,...,...,...,...,...,...,...
17612,ZW,Zimbabwe,True,Specialized Industry,Teaching,71.18,30.68,-18.85,-68.89,-93.70
17613,ZW,Zimbabwe,True,Specialized Industry,Mining,8.97,-112.85,-35.87,-65.38,-93.46
17614,ZW,Zimbabwe,True,Specialized Industry,Personal Coaching,-53.45,-59.70,-88.01,-55.90,-82.23
17615,ZW,Zimbabwe,True,Specialized Industry,Public Health,15.25,-65.53,-57.22,-39.39,-32.14


In [None]:
def create_new_df():
  # add code below to return a df with 'Skills' removed, country_code in uppercase, no skill_group_id or wb_income columns and only for regions containing Asia 
  df['skill_group_category'] = df['skill_group_category'].str.rstrip('Skills')
  df['country_code'] = df['country_code'].str.upper()
  df['wb_region'] = df['wb_region'].str.contains('Asia')
  df['wb_region'] = df['wb_region'] = True
  df_new = df.drop(columns = ['skill_group_id','wb_income'])

  return df_new









# run and test if returned dataframe is correct length, with the right number of columns  and first row skill_group_category is correct 
test_df = create_new_df()
actual_len = len(test_df)
actual_col = len(test_df.columns)
expected_len = 9969
expected_col = 10
actual_skill = test_df['skill_group_category'].iloc[0]
expected_skill = 'Tech '

if actual_len == expected_len and actual_col == expected_col and actual_skill == expected_skill:
  print("Test passed", actual_len, "x", actual_col, actual_skill)
else:
  print("Test failed, expected", expected_len, "x", expected_col, expected_skill, "got", actual_len, "x", actual_col, actual_skill)

AttributeError: ignored

### Exercise 8
---

Write a function called **clean_skills()** that will:
1. rename the **net_per_10K_year** columns to be just the year
2. in the **skill_group_category** column replace the 'z' in 'specialized' with 's' to Anglicise the spelling. 

The function should `return` the cleaned data.  

Hint:  You can use the `replace()` function to replace substring's and characters in both column headings and the actual data.  
* `.str.replace("old","new")`

**Test output**:  
17617 rows × 12 columns, with z replace by s in Specialized  
Column names: country_code	country_name	wb_income	wb_region	skill_group_id	skill_group_category	skill_group_name	2015	2016	2017	2018	2019

In [None]:
import pandas as pd

url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
df = pd.read_excel(url,sheet_name = 'Skill Migration')
df.head()

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,af,Afghanistan,Low income,South Asia,2549,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,af,Afghanistan,Low income,South Asia,2608,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,af,Afghanistan,Low income,South Asia,3806,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,af,Afghanistan,Low income,South Asia,50321,Tech Skills,Software Testing,-957.5,-828.54,-964.73,-406.5,-739.51
4,af,Afghanistan,Low income,South Asia,1606,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64


In [None]:
df['skill_group_category'] = df['skill_group_category'].str.replace("Specialized","Specialised")


In [None]:
df.columns = df.columns.str.replace ('net_per_10K_',' ')

df

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,2015,2016,2017,2018,2019
0,af,Afghanistan,Low income,South Asia,2549,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,af,Afghanistan,Low income,South Asia,2608,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,af,Afghanistan,Low income,South Asia,3806,Specialised Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,af,Afghanistan,Low income,South Asia,50321,Tech Skills,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51
4,af,Afghanistan,Low income,South Asia,1606,Specialised Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64
...,...,...,...,...,...,...,...,...,...,...,...,...
17612,zw,Zimbabwe,Low income,Sub-Saharan Africa,12666,Specialised Industry Skills,Teaching,71.18,30.68,-18.85,-68.89,-93.70
17613,zw,Zimbabwe,Low income,Sub-Saharan Africa,1235,Specialised Industry Skills,Mining,8.97,-112.85,-35.87,-65.38,-93.46
17614,zw,Zimbabwe,Low income,Sub-Saharan Africa,43756,Specialised Industry Skills,Personal Coaching,-53.45,-59.70,-88.01,-55.90,-82.23
17615,zw,Zimbabwe,Low income,Sub-Saharan Africa,1724,Specialised Industry Skills,Public Health,15.25,-65.53,-57.22,-39.39,-32.14


In [None]:
df['skill_group_category'] = df['skill_group_category'].str.replace('specialized','specialised')
df['skill_group_category']

IndexError: ignored

In [None]:
import pandas as pd

url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
migration = pd.read_excel(url,sheet_name = 'Skill Migration')



def clean_skills(df):
  df['skill_group_category'] = df['skill_group_category'].str.replace("Specialized","Specialised")
 
  df.columns = df.columns.str.replace ('net_per_10K_',' ')
  return df

clean_skills(migration)


# run and test if columns have correct names and specialised is anglised 
test_df = clean_skills(migration)

if (test_df['skill_group_category'].str.contains('Specialised').any() == True) and (test_df.columns.str.contains('net_per_10K_').any() == False):
  print("Test passed")
else:
  print("Test failed")


Test passed


### Exercise 9
---

Read the 'Country Migration' sheet.

Write a function that will:  
*  convert the country codes to upper case  
*  drop the lat and long columns for both base and target  
*  rename the net_per_10K_year columns to year only  
*  filter for base_country_wb_region contains 'Africa' and target_country_wb_region contains Asia  

**Test output**:  
```
base_country_code	base_country_name	base_country_wb_income	base_country_wb_region	target_country_code	target_country_name	target_country_wb_income	target_country_wb_region	2015	2016	2017	2018	2019
0	AE	United Arab Emirates	High Income	Middle East & North Africa	AF	Afghanistan	Low Income	South Asia	0.19	0.16	0.11	-0.05	-0.02
4	AE	United Arab Emirates	High Income	Middle East & North Africa	AM	Armenia	Upper Middle Income	Europe & Central Asia	0.10	0.05	0.03	-0.01	0.02
5	AE	United Arab Emirates	High Income	Middle East & North Africa	AU	Australia	High Income	East Asia & Pacific	-1.06	-3.31	-4.01	-4.58	-4.09
6	AE	United Arab Emirates	High Income	Middle East & North Africa	AT	Austria	High Income	Europe & Central Asia	0.11	-0.08	-0.07	-0.05	-0.16
7	AE	United Arab Emirates	High Income	Middle East & North Africa	AZ	Azerbaijan	Upper Middle Income	Europe & Central Asia	0.24	0.25	0.10	0.05	0.04
...	...	...	...	...	...	...	...	...	...	...	...	...	...
4132	ZM	Zambia	Lower Middle Income	Sub-Saharan Africa	GB	United Kingdom	High Income	Europe & Central Asia	43.27	27.60	7.88	6.90	3.68
4135	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	AU	Australia	High Income	East Asia & Pacific	-1.31	-2.33	-2.10	-2.08	-1.84
4138	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	IS	Iceland	High Income	Europe & Central Asia	8.52	6.22	2.35	1.81	0.97
4142	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	NO	Norway	High Income	Europe & Central Asia	2.88	6.46	2.10	0.33	-0.13
4145	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	GB	United Kingdom	High Income	Europe & Central Asia	3.91	4.66	0.74	-0.66	-1.97
478 rows × 13 columns
```



In [None]:
import pandas as pd

url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
df = pd.read_excel(url,sheet_name = 'Country Migration')
df.head()


Unnamed: 0,base_country_code,base_country_name,base_lat,base_long,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_lat,target_long,target_country_wb_income,target_country_wb_region,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,af,Afghanistan,33.93911,67.709953,Low Income,South Asia,0.19,0.16,0.11,-0.05,-0.02
1,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,dz,Algeria,28.033886,1.659626,Upper Middle Income,Middle East & North Africa,0.19,0.25,0.57,0.55,0.78
2,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ao,Angola,-11.202692,17.873887,Lower Middle Income,Sub-Saharan Africa,-0.01,0.04,0.11,-0.02,-0.06
3,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ar,Argentina,-38.416097,-63.616672,High Income,Latin America & Caribbean,0.16,0.18,0.04,0.01,0.23
4,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,am,Armenia,40.069099,45.038189,Upper Middle Income,Europe & Central Asia,0.1,0.05,0.03,-0.01,0.02


In [None]:
df['base_country_code'] = df['base_country_code'].str.upper()
df['target_country_code'] = df['target_country_code'].str.upper()
df = df.drop(columns = ['base_lat','base_long','target_lat','target_long'])
df.columns = df.columns.str.replace ('net_per_10K_',' ')


Unnamed: 0,base_country_code,base_country_name,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_country_wb_income,target_country_wb_region,2015,2016,2017,2018,2019
0,AE,United Arab Emirates,High Income,Middle East & North Africa,AF,Afghanistan,Low Income,South Asia,0.19,0.16,0.11,-0.05,-0.02
1,AE,United Arab Emirates,High Income,Middle East & North Africa,DZ,Algeria,Upper Middle Income,Middle East & North Africa,0.19,0.25,0.57,0.55,0.78
2,AE,United Arab Emirates,High Income,Middle East & North Africa,AO,Angola,Lower Middle Income,Sub-Saharan Africa,-0.01,0.04,0.11,-0.02,-0.06
3,AE,United Arab Emirates,High Income,Middle East & North Africa,AR,Argentina,High Income,Latin America & Caribbean,0.16,0.18,0.04,0.01,0.23
4,AE,United Arab Emirates,High Income,Middle East & North Africa,AM,Armenia,Upper Middle Income,Europe & Central Asia,0.10,0.05,0.03,-0.01,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4143,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,ZA,South Africa,Upper Middle Income,Sub-Saharan Africa,-2.98,-11.79,-9.10,-12.08,-20.76
4144,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,AE,United Arab Emirates,High Income,Middle East & North Africa,-2.50,-2.49,-2.21,-1.68,-3.19
4145,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,GB,United Kingdom,High Income,Europe & Central Asia,3.91,4.66,0.74,-0.66,-1.97
4146,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,US,United States,High Income,North America,38.60,37.76,10.09,6.06,5.25


In [None]:
df_new = df[(df['base_country_wb_region'].str.contains('Africa') == True) & (df['target_country_wb_region'].str.contains('Asia') == True)]
df_new

Unnamed: 0,base_country_code,base_country_name,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_country_wb_income,target_country_wb_region,2015,2016,2017,2018,2019
0,AE,United Arab Emirates,High Income,Middle East & North Africa,AF,Afghanistan,Low Income,South Asia,0.19,0.16,0.11,-0.05,-0.02
4,AE,United Arab Emirates,High Income,Middle East & North Africa,AM,Armenia,Upper Middle Income,Europe & Central Asia,0.10,0.05,0.03,-0.01,0.02
5,AE,United Arab Emirates,High Income,Middle East & North Africa,AU,Australia,High Income,East Asia & Pacific,-1.06,-3.31,-4.01,-4.58,-4.09
6,AE,United Arab Emirates,High Income,Middle East & North Africa,AT,Austria,High Income,Europe & Central Asia,0.11,-0.08,-0.07,-0.05,-0.16
7,AE,United Arab Emirates,High Income,Middle East & North Africa,AZ,Azerbaijan,Upper Middle Income,Europe & Central Asia,0.24,0.25,0.10,0.05,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4132,ZM,Zambia,Lower Middle Income,Sub-Saharan Africa,GB,United Kingdom,High Income,Europe & Central Asia,43.27,27.60,7.88,6.90,3.68
4135,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,AU,Australia,High Income,East Asia & Pacific,-1.31,-2.33,-2.10,-2.08,-1.84
4138,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,IS,Iceland,High Income,Europe & Central Asia,8.52,6.22,2.35,1.81,0.97
4142,ZW,Zimbabwe,Low Income,Sub-Saharan Africa,NO,Norway,High Income,Europe & Central Asia,2.88,6.46,2.10,0.33,-0.13


In [None]:
import pandas as pd

url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
migration = pd.read_excel(url,sheet_name = 'Country Migration')
df.head()

def clean_country_mig(df):
  # add code below to clean the data 
  df['base_country_code'] = df['base_country_code'].str.upper()
  df['target_country_code'] = df['target_country_code'].str.upper()
  df = df.drop(columns = ['base_lat','base_long','target_lat','target_long'])
  df.columns = df.columns.str.replace ('net_per_10K_',' ')

  df_new = df[(df['base_country_wb_region'].str.contains('Africa') == True) & (df['target_country_wb_region'].str.contains('Asia') == True)]
  print(df_new)
  return df_new

clean_country_mig(migration)





# run test if there is the correct number of columns, country codes are in uppercase and year columns have been reformatted 

test_df = clean_country_mig(migration)
actual_col_len = len(test_df.columns)
expected = 13

if actual_col_len == expected and (df['base_country_code'].str.islower().any() == False) and (df.columns.str.contains('net_per_10K_').any() == False):
  print("Test passed")
else:
  print("Test failed")

     base_country_code     base_country_name base_country_wb_income  \
0                   AE  United Arab Emirates            High Income   
4                   AE  United Arab Emirates            High Income   
5                   AE  United Arab Emirates            High Income   
6                   AE  United Arab Emirates            High Income   
7                   AE  United Arab Emirates            High Income   
...                ...                   ...                    ...   
4132                ZM                Zambia    Lower Middle Income   
4135                ZW              Zimbabwe             Low Income   
4138                ZW              Zimbabwe             Low Income   
4142                ZW              Zimbabwe             Low Income   
4145                ZW              Zimbabwe             Low Income   

          base_country_wb_region target_country_code target_country_name  \
0     Middle East & North Africa                  AF         Afghanista

### Exercise 10
---

Read the data from file 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'.

Write a function that will return a new dataframe with just the married women listed, surname only.

**Test output**:  
```
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser	female	14.0	1	0	237736	30.0708	NaN	C
15	16	1	2	Hewlett	female	55.0	0	0	248706	16.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
871	872	1	1	Beckwith	female	47.0	1	1	11751	52.5542	D35	S
874	875	1	2	Abelson	female	28.0	1	0	P/PP 3381	24.0000	NaN	C
879	880	1	1	Potter	female	56.0	0	1	11767	83.1583	C50	C
880	881	1	2	Shelley	female	25.0	0	1	230433	26.0000	NaN	S
885	886	0	3	Rice	female	39.0	0	5	382652	29.1250	NaN	Q
129 rows × 12 columns
```





In [None]:
import pandas as pd

url ='https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'
df = pd.read_csv(url)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
#married = df['Name'].str.contains('Mrs.')
df = df[df['Name'].str.contains('Mrs.') == True]
df['Name'] = df['Name'].str.split().str.get(0)
df['Name'] = df['Name'].str.replace(',','')
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,Cumings,female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,Futrelle,female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,Johnson,female,27.0,0,2,347742,11.1333,,S
9,10,1,2,Nasser,female,14.0,1,0,237736,30.0708,,C
15,16,1,2,Hewlett,female,55.0,0,0,248706,16.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,Beckwith,female,47.0,1,1,11751,52.5542,D35,S
874,875,1,2,Abelson,female,28.0,1,0,P/PP 3381,24.0000,,C
879,880,1,1,Potter,female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,Shelley,female,25.0,0,1,230433,26.0000,,S


In [None]:
import pandas as pd

url ='https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'
titanic = pd.read_csv(url)


def get_married(df):
  # add code to return only the last names of married women
  df = df[df['Name'].str.contains('Mrs.') == True]
  df['Name'] = df['Name'].str.split().str.get(0)
  df['Name'] = df['Name'].str.replace(',','')
  return df



# run and test if returned dataframe is correct length and has correct first row 
test_df = get_married(titanic)
actual_len = len(test_df)
expected_len = 129
actual_name = test_df['Name'].iloc[0]
expected_name = 'Cumings'

if actual_len == expected_len and actual_name == expected_name:
  print("Test passed, ", actual_len, actual_name)
else:
  print("Test failed expected ", expected_len, expected_name, "got", actual_len, actual_name)

Test passed,  129 Cumings


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


---
### Optional extra practice

There are some similar and some more challenging exercises [here](https://www.w3resource.com/python-exercises/date-time-exercise/) if you would like to practice more. The site has its own editor.

# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer:

ways to clean the data: 
str.contains.()
str.upper()
str.split().str.get()
str.replace()

df[(df['area'].str.endswith('ham') == True)]

## What caused you the most difficulty?

Your answer:
str.contains/startswith/endswith.()

contains specific words and return to True or False to the value