## Import the Pandas library and create DataFrame

Before doing anything else, you'll need to import Pandas and get some data to work with.

In [1]:
import pandas as pd

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Claire', 'David', 'Emma'],
    'Age': [25, 30, 22, 28, 26],
    'Department': ['Marketing', 'Finance', 'Sales', 'HR', 'Marketing'],
    'City': ['New York', 'London', 'Paris', 'San Francisco', 'Sydney'],
    'Email': ['alice@example.com', 'bob@example.com', 'claire@example.com', 'david@example.com', 'emma@example.com'],
    'Job_Title' : ['Data Scientist', 'Financial Analyst', 'Sales Executive', 'HR Manager', 'Marketing Specialist'],
    'Full_Name' : ['Alice Johnson', 'Bob Smith', 'Claire Williams', 'David Lee', 'Emma Brown'],
    'Phone' : ['123-456-7890', '987-654-3210', '555-123-4567', '111-222-3333', '444-555-6666'],
    'Address' : ['       123 Main Street', '        456 Park Avenue', '         789 Elm Road', '        321 Oak Street','        555 Maple Lane']
}

df = pd.DataFrame(data)

print('\n', df)
df.info()


      Name  Age Department           City               Email  \
0   Alice   25  Marketing       New York   alice@example.com   
1     Bob   30    Finance         London     bob@example.com   
2  Claire   22      Sales          Paris  claire@example.com   
3   David   28         HR  San Francisco   david@example.com   
4    Emma   26  Marketing         Sydney    emma@example.com   

              Job_Title        Full_Name         Phone  \
0        Data Scientist    Alice Johnson  123-456-7890   
1     Financial Analyst        Bob Smith  987-654-3210   
2       Sales Executive  Claire Williams  555-123-4567   
3            HR Manager        David Lee  111-222-3333   
4  Marketing Specialist       Emma Brown  444-555-6666   

                   Address  
0          123 Main Street  
1          456 Park Avenue  
2             789 Elm Road  
3           321 Oak Street  
4           555 Maple Lane  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 

## The `.str` accessor

In Pandas, the `.str` accessor allows us to perform various string operations on DataFrame columns containing string values. This provides a convenient and efficient way to work with textual data within DataFrames.

> **Note:** The `.str` accessor works on Series. To use any of the tools below, be sure to specify which column of a DataFrame you wish to work with.

### String indexing and slicing

We can access individual characters of each string in a DataFrame column using string indexing. Let's select the first letter of each name:

In [3]:
df['Name'].str[0]

0    A
1    B
2    C
3    D
4    E
Name: Name, dtype: object

Using a slice in the string indexer, we'll now select the first four characters of each city.

In [4]:
df['City'].str[:4]

0    New 
1    Lond
2    Pari
3    San 
4    Sydn
Name: City, dtype: object

### Converting cases

`.str.lower()`      
With this method you can change all letters to lower case.

In [5]:
df['Full_Name'].str.lower()

0      alice johnson
1          bob smith
2    claire williams
3          david lee
4         emma brown
Name: Full_Name, dtype: object

Similarly, Pandas offers:  

`str.upper()`  
Converts all characters to uppercase.  

`str.title()`  
Converts the first character of each word to uppercase and the remaining characters to lowercase.  

`str.capitalize()`  
Converts first character of the whole string to uppercase and the remaining characters to lowercase.  

`str.swapcase()`  
Converts uppercase to lowercase and lowercase to uppercase.  

`str.casefold()`  
Removes all case distinctions in the string. This method is meant to deal with the kind of special characters generally not recognized as having upper and lower cases, e.g. "ß" becomes "ss".


### Conditions

`.str.startswith()` and `.str.endswith()`  
Often used for filtering DataFrames, these methods will check if the first or last character(s) in each string match(es) the given string.

In [6]:
df['Email'].str.startswith('david')

0    False
1    False
2    False
3     True
4    False
Name: Email, dtype: bool

`.str.contains()`  
Another useful method for filtering `.str.contains()` checks if any part of each string matches the given string.

In [7]:
df['Email'].str.contains('@')

0    True
1    True
2    True
3    True
4    True
Name: Email, dtype: bool

### Length and counting

`.str.len()`  
This method will count and return the number of characters in each string.

In [8]:
df['Address'].str.len()

0    22
1    23
2    21
3    22
4    22
Name: Address, dtype: int64

`.str.count()`  
returns the count of occurrences of a specified substring in each string of the column

In [9]:
df['Email'].str.count('e')

0    3
1    2
2    3
3    2
4    3
Name: Email, dtype: int64

### Manipulating strings

`.str.replace()`  
Used to locate one sub-string and, if it exists, replace it with another.

In [10]:
df['City'].str.replace('New', 'Old')

0         Old York
1           London
2            Paris
3    San Francisco
4           Sydney
Name: City, dtype: object

`.str.strip()`  
It is not uncommon for data to end up carrying certain artefacts of the ETL process, often as leading or tailing characters. Most commonly, this will result in whitespace; `.str.strip()` removes whitespace before and after a string by default, and can remove others when specified.

In [11]:
df['Address']

0            123 Main Street
1            456 Park Avenue
2               789 Elm Road
3             321 Oak Street
4             555 Maple Lane
Name: Address, dtype: object

In [12]:
df['Address'].str.strip().str.len()

0    15
1    15
2    12
3    14
4    14
Name: Address, dtype: int64

`.str.split()`  
Used to break a string down into its constituent parts, `.str.split()` will search a string for a given character, creating items in a list each time that character is encountered. By default, splits will be made on whitespace.

In [13]:
df['Full_Name'].str.split()

0      [Alice, Johnson]
1          [Bob, Smith]
2    [Claire, Williams]
3          [David, Lee]
4         [Emma, Brown]
Name: Full_Name, dtype: object

The resulting lists can be accessed with a further `.str` followed by an indexer or `.get()`.

In [14]:
df['Full_Name'].str.split(' ').str[1]

0     Johnson
1       Smith
2    Williams
3         Lee
4       Brown
Name: Full_Name, dtype: object

In [15]:
df['Full_Name'].str.split(' ').str.get(1)

0     Johnson
1       Smith
2    Williams
3         Lee
4       Brown
Name: Full_Name, dtype: object

## Regular expressions

Regular expressions, commonly known as regex, are powerful tools for pattern matching and text manipulation.  
The regex syntax consists of metacharacters, quantifiers, character classes, and more, which define the rules for matching patterns in strings.

Common Metacharacters and Their Meanings:

**. (Period)**: Matches any character except a newline.  
For example, the pattern a.b will match 'aab', 'acb', 'a9b', but not 'a\nb'.

**\* (Asterisk)**: Matches zero or more occurrences of the preceding character.

 For example, the pattern ab*c will match 'ac', 'abc', 'abbc', 'abbbc', and so on.

**\+ (Plus)**: Matches one or more occurrences of the preceding character.

 For example, the pattern ab+c will match 'abc', 'abbc', 'abbbc', but not 'ac'.

**? (Question Mark)**: Matches zero or one occurrence of the preceding character.  
For example, the pattern colou?r will match both 'color' and 'colour'.

**^ (Caret)**: Matches the start of a string.  
For example, the pattern ^abc will match 'abc' only if it appears at the beginning of a string.

**\$ (Dollar)**: Matches the end of a string.  
For example, the pattern abc$ will match 'abc' only if it appears at the end of a string.

**[ ] (Square Brackets)**: Matches any single character within the specified set.  
For example, the pattern [aeiou] will match any vowel.

**[^] (Caret Inside Square Brackets)**: Matches any single character not within the specified set.  
For example, the pattern [^aeiou] will match any non-vowel.

Try [this site](https://regex101.com/) for diving deeper into regular expressions.

### Regex in Pandas
In pandas, certain methods allow for regex pattern matching — some by default and others when explicitly set to do so.

In [16]:
df['Phone'].str.contains(r'\d+-\d+-\d+')

0    True
1    True
2    True
3    True
4    True
Name: Phone, dtype: bool

`\d+-\d+-\d+`     
This pattern matches all strings that containg two minus signs with 1 or more number characters between them.

## Challenges

In [17]:
data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco', 'London', 'Paris', 'Berlin', 'Rome', 'Tokyo'],
    'Country': ['USA', 'USA', 'USA', 'USA', 'USA', 'UK', 'France', 'Germany', 'Italy', 'Japan'],
    'Population (Millions)': [8.4, 3.9, 2.7, 2.3, 0.9, 8.9, 2.1, 3.7, 2.8, 13.9],
    'Area (km2)': [468.9, 502.8, 227.6, 1, 121.4, 1572, 105.4, 891.8, 1285, 2187],
    'Language': ['English', 'English', 'English', 'English', 'English', 'English', 'French', 'German', 'Italian', 'Japanese'],
    'Currency': ['USD', 'USD', 'USD', 'USD', 'USD', 'GBP', 'EUR', 'EUR', 'EUR', 'JPY'],
    'Continent': ['North America', 'North America', 'North America', 'North America', 'North America', 'Europe', 'Europe', 'Europe', 'Europe', 'Asia'],
    'Is_Capital': [False, False, False, False, False, True, True, True, True, True]
}

cities_df = pd.DataFrame(data)

# Adding more rows
extra_data = {
    'City': ['Sydney', 'Seoul', 'Beijing', 'Moscow', 'Cairo', 'Mumbai'],
    'Country': ['Australia', 'South Korea', 'China', 'Russia', 'Egypt', 'India'],
    'Population (Millions)': [5.3, 9.7, 21.5, 12.5, 9.5, 20.7],
    'Area (km2)': [1687, 605, 16411, 2561, 3034, 603],
    'Language': ['English', 'Korean', 'Mandarin', 'Russian', 'Arabic', 'Hindi'],
    'Currency': ['AUD', 'KRW', 'CNY', 'RUB', 'EGP', 'INR'],
    'Continent': ['Australia', 'Asia', 'Asia', 'Europe', 'Africa', 'Asia'],
    'Is_Capital': [False, True, True, True, True, False]
}

extra_df = pd.DataFrame(extra_data)
cities_df = pd.concat([cities_df, extra_df], ignore_index=True)

cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital
0,New York,USA,8.4,468.9,English,USD,North America,False
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False
2,Chicago,USA,2.7,227.6,English,USD,North America,False
3,Houston,USA,2.3,1.0,English,USD,North America,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False
5,London,UK,8.9,1572.0,English,GBP,Europe,True
6,Paris,France,2.1,105.4,French,EUR,Europe,True
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True


### Challenge 1
Create a new column 'City_Length' that contains the length of each city name

In [18]:
# Your code here
cities_df['City_Length'] = cities_df['City'].str.len()
cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length
0,New York,USA,8.4,468.9,English,USD,North America,False,8
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7
3,Houston,USA,2.3,1.0,English,USD,North America,False,7
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5


### Challenge 2
Convert the 'City' names to uppercase and store them in a new column 'City_Upper'

In [19]:
# Your code here
cities_df['City_Upper'] = cities_df['City'].str.upper()
cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper
0,New York,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO


### Challenge 3
Check if the 'City' names end with the letter 'o'. Create a new column 'Ends_With_O' with the boolean results

In [20]:
# Your code here
cities_df['Ends_With_O'] = cities_df['City'].str.endswith('o')
cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O
0,New York,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True


### Challenge 4
Replace the word 'York' in 'City' names with 'Ville' and update the 'City' column accordingly

In [23]:
# Your code here
cities_df['City'] = cities_df['City'].str.replace('York', 'Ville')
cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O
0,New Ville,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True


### Challenge 5
Create a new column 'Country_Code' by extracting the first three characters from the 'Country' names

In [44]:
# Your code here
cities_df['Country_Code'] = cities_df['Country'].str[0:3]
cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O,Country_Code
0,New Ville,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False,USA
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False,USA
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True,USA
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False,USA
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True,USA
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False,UK
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False,Fra
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False,Ger
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False,Ita
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True,Jap


### Challenge 6
Count the occurrences of the letter 'a' in each 'City' name

In [46]:
# Your code here
# a_count = cities_df['City'].str.count('a')
pd.concat([cities_df['City'], cities_df['City'].str.lower().str.count('a')], axis = 1)
#how to rename the new column here?

Unnamed: 0,City,City.1
0,New Ville,0
1,Los Angeles,1
2,Chicago,1
3,Houston,0
4,San Francisco,2
5,London,0
6,Paris,1
7,Berlin,0
8,Rome,0
9,Tokyo,0


### Challenge 7
Check if the 'City' names start with the letter 'C' and end with the letter 'o'

In [42]:
# Your code here
pd.concat([cities_df, cities_df['City'].str.contains(r'^C.*o$')], axis = 1)
#cities_df['City'].str.contains(r'^C.*o$')

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O,Country_Code,City.1
0,New Ville,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False,USA,False
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False,USA,False
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True,USA,True
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False,USA,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True,USA,False
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False,UK,False
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False,Fra,False
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False,Ger,False
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False,Ita,False
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True,Jap,False


### Challenge 7
Check if the 'City' names contain exactly two words

In [43]:
# Your code here
pd.concat([cities_df, cities_df['City'].str.contains(r'^\S+ \S+$')], axis = 1)

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O,Country_Code,City.1
0,New Ville,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False,USA,True
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False,USA,True
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True,USA,False
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False,USA,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True,USA,True
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False,UK,False
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False,Fra,False
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False,Ger,False
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False,Ita,False
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True,Jap,False
