<a href="https://colab.research.google.com/github/edoardochiarotti/class_datascience/blob/main/2024/00_Python-Basics/00_Python-Basics_5_String-methods.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python-Basics: string methods

<img src='https://www.agent-x.com.au/wp-content/uploads/2011/06/Perfect-Programmer-dfe194b-e8d3b11-b960bd5.jpg' width="400">

Source: [Agent-X Comics - Perfect Programming](https://www.agent-x.com.au/comic/perfect-programming/)

## Content

We have previously discovered methods for lists and dictionaries. Do you remember what are methods? They are functions that are specific to a given object. In this notebook, we will explore string methods!

- [String methods](#string-methods)
   - [Indexing and slicing of strings](#string-index)
   - [Counting](#Counting)
   - [Case conversion](#case-conversion)
   - [Finding substring](#find)
   - [Replacing substring](#replace)
   - [Joining](#Joining)
   - [Formatting](#Formatting)
   - [f-strings](#f-strings)

## String methods <a name="string-methods"></a> 

Oftentimes, we will work with text data, and not only to perform Text Analytics. For example, imagine you have the following table:

|Name|City | Gender|Yearly Electricity consumption $(kWh)$|Green premium ($CHF/kWh$)|Green electricity mix|
|:-------|:----------|:----------|:----------|:----------|:----------|
|Valjean| Lausanne | M | 3600 | 1.00 | No |
|Javert | Geneva | M | 5400 | 0.10 | Yes |
|Fantine | Bern | F | NA | 0.30 | No |
|Cosette | Zurich | F | 7200 | 0.75 | Yes |

The table contains, for several individuals, their name, city of residence, gender, yearly electricity consumption, green premium (i.e., how much they are willing to pay to get electricity produced by renewable energy), and whether or not they already subscribe to the option to get green electricity. By the way, in Switzerland, the default option is the green electricity mix, so except if you are asking to unsubscribe, you would get electricity produced by renewable (generally, hydropower). You can also ask to get a more expensive electricity mix, made of local renewable energy, such as photovoltaics and wind turbines.

Back to our table. We notice that the table is mixing text and numbers. Several variables in a text format might be useful for our analysis and should be converted to numbers. For instance, we might want to express the variable "Green electricity mix" as: 1 if the individual subscribed ('Yes'), 0 if not ('No').

In addition, we notice that even when the data format is number, we could have some text: the Electricity consumption of Fantine is 'NA', i.e., 'Not Available'. We could discard the observation from Fantine from the analysis - that's probably what you should do in this case. But for the sake of the argument, we could also infer a value and thus replace 'NA' by a number, for instance, taking the mean of the sample, replacing by the closest observation, or even generating a random number. As a good practice, **you should always document all the modifications you are doing to the original data, detailing your assumptions and methods used**.

So, the point is, we will often need to manipulate text data, i.e., strings in Python. As was the case for lists and dictionaries, there exists many string methods. Here are some of the common ones:

|Method|Description|
|:-------|:----------|
|`s.capitalize()` | Converts the first character to upper case|
|`s.casefold()` | Converts string into lower case |
|`s.count(value)` | Returns the number of times a specified value occurs in a string|
|`s.endswith()` |Returns true if the string ends with the specified value|
|`s.find(value)` | Searches the string for a specified value and returns the position of where it was found|
|`s.format()` | Formats specified values in a string|
|`s.index(value)` | Searches the string for a specified value and returns the position of where it was found|
|`s.isalnum()` | Returns True if all characters in the string are alphanumeric|
|`s.isalpha` | Returns True if all characters in the string are in the alphabet|
|`s.isdigit()` | Returns True if all characters in the string are digits|
|`s.islower()`|Returns True if all characters in the string are lower case|
|`s.isnumeric()` | Returns True if all characters in the string are numeric|
|`s.isupper()`|Returns True if all characters in the string are upper case|
|`s.join()`|Converts the elements of an iterable into a string|
|`s.lower()`|Converts a string into lower case|
|`s.maketrans()`|Returns a translation table to be used in translations|
|`s.replace(old_value, new_value)`|Returns a string where a specified value is replaced with a specified value |
|`s.rfind(value)`|Searches the string for a specified value and returns the last position of where it was found |
|`s.rindex(value)`|Searches the string for a specified value and returns the last position of where it was found |
|`s.rsplit(separator)`|Splits the string at the specified separator, and returns a list |
|`s.split(separator)`|Splits the string at the specified separator, and returns a list |
|`s.splitlines()`|Splits the string at line breaks and returns a list |
|`s.startswith(value)`|Returns true if the string starts with the specified value |
|`s.swapcase()`|Swaps cases, lower case becomes upper case and vice versa |
|`s.title()`|Converts the first character of each word to upper case|
|`s.translate()`|Returns a translated string using a dictionary or mapping table|
|`s.upper()`|Converts a string into upper case|

We will explore some of these string processing tools below. You can find a complete list of string methods from the [Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods). 



### Indexing and slicing of strings <a name="string-index"></a> 

As previously mentioned, strings are sequences of characters, in the same vein as lists and tuples are sequence of elements. Hence, indexing and slicing works the same as it does for lists and tuples! Similarly, built-in functions such as `len()` also applies to strings!

In [175]:
my_str = 'The Green Premium is the additional cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

print(len(my_str))
print(my_str[97])
print(my_str[:18])
print(my_str[::-1])

129
a
The Green Premium 
.sesag esuohneerg fo tnuoma retaerg a stime taht eno revo ygolonhcet naelc a gnisoohc fo tsoc lanoitidda eht si muimerP neerG ehT


### Counting <a name="Counting"></a>

We can use `count()` to calculate the number of occurrences of a character:

In [183]:
my_str.count('a')

9


The method also allows to calculate the number of times a substring appears in a string:

In [185]:
my_str.count('ee')

2

However, substrings cannot overlap. For instance, suppose we have 6 times the character 'A', then `count('AA')` will only return 3:

In [187]:
'AAAAAA'.count('AA')

3

In addition, `count()` and other similar methods are case-sensitive:

In [188]:
'AAaaa'.count('A')

2

### Case conversion <a name="case-conversion"></a> 

To avoid the issue of case-sensitiveness, you can convert all characters of your string to lower cases:

In [189]:
my_str.lower()

'the green premium is the additional cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

Alternatively, to upper cases:

In [190]:
my_str.upper()

'THE GREEN PREMIUM IS THE ADDITIONAL COST OF CHOOSING A CLEAN TECHNOLOGY OVER ONE THAT EMITS A GREATER AMOUNT OF GREENHOUSE GASES.'

Note that `lower()` and `upper()` did not modify our original string:

In [192]:
my_str

'The Green Premium is the additional cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

Can you guess why? It's because `string` is immutable. It's very convenient to avoid bad manipulations. But if you want to operate on a modified string, you need to store your modified string into a variable:

In [205]:
my_str_low = my_str.lower()
my_str_low

'the green premium is the additional cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

### Finding substring  <a name="find"></a> 

First, let's check if the word `'technology'` belongs to our `string`. We use the `in` operator:

In [206]:
'technology' in my_str_low

True

We got `True`, so indeed the sequence of characters forming the word 'technology' does appear in our string. Let's find where! The `find()` methods gives the index where the substring *first* appears:

In [207]:
my_str_low.find('technology')

61

What if a substring is not in the string?

In [208]:
my_str_low.find('choice')

-1

We got `-1`. Beware! It is not the index `-1`! `find()` will always returns positive indices if it finds a substrings, and will return `-1` otherwise.

Now what happens if a substring appears several times? As mentioned above, `find()` will return the index where the substring *first* appears. If we want the last instance, we can use `rfind()` (i.e., right find): 

In [210]:
print(my_str_low.find('the'))
print(my_str_low.rfind('the'))

0
21


When the substring appears only once, `find()` and `rfind()` return the same index:

In [238]:
print(my_str_low.find('technology'))
print(my_str_low.rfind('technology'))

61
61


### Replacing substring <a name="replace"></a> 

Here is a very convenient method: `replace()` will replace a substring by another. For instance, say we want to replace `'additional'` by `'extra'`:

In [211]:
my_str_low.replace('additional', 'extra')

'the green premium is the extra cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

But remember, `string` is immutable, so we actually did not change our original string!

In [212]:
my_str_low

'the green premium is the additional cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

`replace()` converts all the substring of a string (and not only the first one)

In [245]:
my_str_low.replace('the','')

' green premium is  additional cost of choosing a clean technology over one that emits a greater amount of greenhouse gases.'

`replace()` will thus come in handy to replace missing/error values in our data, remove punctuation, or, for instance, convert Yes/No statements into 1 and 0 values.

### Joining <a name="Joining"></a>

The `join()` method allows to concatenate the elements of a iterable (e.g., a list or a tuple) into one string:

In [213]:
word_tuple = ('circular', 'economy', 'premium')

' '.join(word_tuple)

'circular economy premium'

Note the syntax: the string before the dot `' '.` specifies what goes between the element of the iterable in the string, e.g., here, a space. But you are free to specify the seperator you wish:

In [214]:
' ... '.join(word_tuple)

'circular ... economy ... premium'

By the way, if you are wondering what is the circular economy premium, just ask Edoardo, we are working on the topic with E4S :) 

### Formatting <a name="Formatting"></a>

Remember when we wanted to combine strings and we used the operator `+`? Well, there is another much more powerful way using the `format()` method. Here is how it looks:

In [254]:
obs_description = ''' 
Our last observation is {name}. 
She lives in {city}.'''.format(name = 'Cosette', city = 'Zurich')

print(obs_description)

 
Our last observation is Cosette. 
She lives in Zurich.


See what happened? We applied the `format()` method to our string. As arguments, the `format()` method is using named keyword arguments (kwargs), here `name = 'Cosette'` and `city = 'Zurich'`. Withing the string, the kwargs are given in curly braces. The `format()` method inserts the strings as the place delimited by braces.

That's nice, isn't it? But can we reach the next level and combine strings and numbers into a string, without relying on the function `str()`? Yes, we can! By using, **string conversion**, we can specify how a number should be represented in a string. The table below shows some of the most common conversion. A complete list is available in the [Python documentation](https://docs.python.org/3/library/stdtypes.html#printf-style-bytes-formatting).

|conversion|description|
|:----------:|-----------|
|`d`| integer|
|`04d`| integer with four digits, possibly with leading zeros|
|`f`| float, default to six digits after decimal|
|`.8f`| float with 8 digits after the decimal|
|`e`| scientific notation, default to six digits after decimal|
|`.16e`| scientific notation with 16 digits after the decimal|
|`s`| display as a string|

Let's try with some examples:

In [228]:
print('Her electricity consumption is {n:d} kWh per year.'.format(n=7200))
print('She is willing to pay an extra {c:f} swiss francs per kWh to get green electricity.'.format(c=0.75))
print('She is willing to pay an extra {c:.3f} swiss francs per kWh to get green electricity.'.format(c=0.75))
print('She is willing to pay an extra {c:.2e} swiss francs per kWh to get green electricity.'.format(c=0.75))
print('She answered {answer:s} to the question of whether she already subscribed to the Green Electricity Mix.'.format(answer='Yes'))

Her electricity consumption is 7200 kWh per year.
She is willing to pay an extra 0.750000 swiss francs per kWh to get green electricity.
She is willing to pay an extra 0.750 swiss francs per kWh to get green electricity.
She is willing to pay an extra 7.50e-01 swiss francs per kWh to get green electricity.
She answered Yes to the question of whether she already subscribed to the Green Electricity Mix.


Do you understand the syntax? It is the same as for strings: kwargs as arguments of the `format()` method and kwargs inside braces in the string. However, now we also specify the conversion, i.e., how the kwarg should appear in the string. We do so by including in the braces after our kwarg a colon `:` followed by the string conversion code. For instance, for the first string, we have inside the braces `{n:d}`: `n` is our kwarg, and `d`is the string conversion for integer. Similarly in the second string, except this time we are using `f` to specify we convert a float. Note that by default, a float will appear with six digits after decimal. If we want to adjust the number of decimal, we can: in the third string we have `.3f`, meaning we only want three decimals. 
Quite neat, isn't it? Then, wait, it gets even better!

### f-strings <a name="f-strings"></a>

There is an even faster and more efficient way than the `format()` method! **f-strings** allow to directly insert some variables into strings. Even better, the syntax is quite simple: we prefix our string with `f` or `F`, and then add the name of the variable we want to insert between braces. We don't even need to specify the conversion, except if we want to adjust the default setting - for example, modify the number of decimals displayed for floats. In that case, we are using the same conversion code as for the `format()` method. Here are some illustrations:

In [236]:
name = 'Cosette'
city = 'Zurich'
consumption = 7200
wtp = 0.75

print(f'Our last observation is {name}.')
print(f'She lives in {city}.')
print(f'Her electricity consumption is {consumption} kWh per year.')
print(f'She is willing to pay an extra {wtp:.3f} swiss francs per kWh to get green electricity.')

Our last observation is Cosette.
She lives in Zurich.
Her electricity consumption is 7200 kWh per year.
She is willing to pay an extra 0.750 swiss francs per kWh to get green electricity.


Alright, this concludes our introduction to Python. Make sure to practise, it is the only way to learn programming. But for now, congrats if you made it this far, you deserve a treat!

<img src='https://miro.medium.com/v2/resize:fit:640/format:webp/0*hIcd2XXtpTwXTYgl.png' width="400">