### String, String Methods, and String Manipulation

The following notebook partially covers strings in Python, string manipulation in data science, and few examples related to the topic.

__String__ is collection of characters. Any character can be accessed by its index. The indexing of a string starts at 0 (or -1 if it’s indexed from the end). We can get the number of characters in as string by using built-in function __len__. Compared to the indexing, __len__ is not zero based.

In [1]:
# Creating a string
text = 'Some collection of words'

# Assigning the number of characters in the given string to a variable
total_char = len(text)

# Printing the total number of characters
print('Number of characters = {}'.format(total_char))

Number of characters = 24



Let's look at the indexing of the string. As it has been already mentioned, __indexing is zero based__, even though the total amount of characters in the given string is 24, while indexing is in the range (0, 23). It is demonstrated in the cell below. The characters are looped and printed with the corresponding indices under each character. 

__{0:3}__ adds 3 space holders for each character printed. __print()__ function adds new line after it's executed. By using __end=""__ we can continue printing on the same line.

In [2]:
# Looping through the characters in the given string
for letter in text:
    print('{0:3}'.format(letter), end="")

# Switching to the next line
print()

# Integers are right-aligned by default. We can use '<' to align to the left
for i in range(total_char):
    print('{0:<3d}'.format(i), end="")

S  o  m  e     c  o  l  l  e  c  t  i  o  n     o  f     w  o  r  d  s  
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Similar output but with the old formatting, using the __%__ operator.

In [3]:
for letter in text:
    print('%-3s' % letter, end="")
    
print()

for i in range(len(text)):
    index = i - 1
    print('%-3s' % i, end="")

S  o  m  e     c  o  l  l  e  c  t  i  o  n     o  f     w  o  r  d  s  
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 

<br></br>
String can be sliced. We define slicing from the start index to the last index we need, excluding the last. Also string can be printed backwards:

In [4]:
text[0:10]

'Some colle'

In [5]:
text[::-1]

'sdrow fo noitcelloc emoS'

More on the topic:
* Input and Output from the [Python tutorials](https://docs.python.org/3/tutorial/inputoutput.html)
* Various [String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods)
* Formatting the output [Tutorial](https://www.python-course.eu/python3_formatted_output.php)

#### Example
Let's look at some string methods and their implementation. Kaggle [Titanic](https://www.kaggle.com/c/titanic) challenge provides a data set which is split in two different sets (training and test sets). One of the columns provided in the data set contains the names of the passengers. At the first glance it may look like that the column is unusable, but by implementing feature engineering we can extract additional data from this column. If we look closely, we can spot that the names are in the format Last Name, Title, First (and sometimes also Middle name), ex: Braund, Mr. Owen Harris.  The data is in string format. We can extract the title from each row of names, and categorize them. One of the solutions available for such problem is shown below:

In [6]:
# Importing pandas library
import pandas as pd

# Loading the given data set to a pandas DataFrame
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Each name starts with a last name, followed by a comma, and a blank space after comma. Title is followed by a period sign. Using __find__ method, we can detect the position of the characters we are looking for. In this case we are looking for comma (,) and period (.) characters. Let's look at a single example:

In [7]:
# Creating a string
name = 'Braund, Mr. Owen Harris'

# Detecting the indices of the required characters
name.find(','), name.find('.')

(6, 10)

__find__ method shows that the comma is at the index 6, and the period is at the index 10 for the given string. Title for the given string (in this case Mr) can be sliced by using index the range (8, 10).

In [8]:
# Slicing the string
name[name.find(',') + 2 : name.find('.')]

'Mr'

Once the general pattern is available, __for loop__ can be implemented to slice each name in the column to extract the title, and later to save the title to the given DataFrame.

In [9]:
# Creating an empty list to store the titles
array = []

# Looping through the DataFrame
for name in df['Name']:
    # Extracting the title by slicing the string
    title = name[name.find(',') + 2 : name.find('.')]
    # Appending the title to the list
    array.append(title)

# Appending the list to the DataFrame under the column 'Title'
df['Title'] = array

# Checking the first 5 titles
df['Title'][:5]

0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object

There is another way of achieving the same result by using __split__ and __strip__ methods.

In [10]:
# Creating a string
name = 'Braund, Mr. Owen Harris'

# Applying split and strip methods
name.split(',')[1].split('.')[0].strip()

'Mr'

Let's apply this way step by step:

1. Name string is split into an array of two strings at the comma sign.
2. The first index (the second element) of an array is selected and then split again. This time the new string is split into an array of two strings at the period sign.
3. At the end, by applying __strip__ method, blank space is removed from the 0 index of an array. 

Step by step is demonstrated in the below code:

In [11]:
print('Step 1 - {}'.format(name.split(',')))
print('Step 2 - {}'.format(name.split(',')[1].split('.')))
print('Step 3 - {}'.format(name.split(',')[1].split('.')[0].strip()))

Step 1 - ['Braund', ' Mr. Owen Harris']
Step 2 - [' Mr', ' Owen Harris']
Step 3 - Mr


__Lambda__ expression and __pandas.map()__ method can be used for a cleaner code, substituting the for loop.

In [12]:
# Appending a new column to the DataFrame and applying lambda expression
df['NewTitle'] = df['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())

# Checking the first 5 rows of the DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,NewTitle
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,Mr
