### Strings, Strings Methods, and Strings Manipulation

The following notebook partially covers strings in Python, string manipulation in data science, and few examples related to the topic.

__Strings__ are collections of characters. Any character can be accessed by its index. The indexing of a string starts at 0 (or -1 if it’s indexed from the end). We can get the number of characters in as string by using built-in function __len__. Compared to the indexing, __len__ is not zero based.

In [1]:
# Creating a string
text = 'Some collection of words'

# Assigning the number of characters in the given string to a variable
total_char = len(text)

# Printing the total number of characters
print('Number of characters = {}'.format(total_char))

Number of characters = 24


<br></br>
Let's look at the indexing of the string. As it has been already mentioned, __indexing is zero based__, even though the total amount of characters in the given string is 24, while indexing is in the range (0, 23). It is demonstrated in the cell below. The characters are looped and printed with the corresponding indices under each character. 

__{0:3}__ adds 3 space holders for each character printed. __print()__ function adds new line after it's executed. By using __end=""__ we can continue printing on the same line.

In [2]:
# Looping through the characters in the given string
for letter in text:
    print('{0:3}'.format(letter), end="")

# Switching to the next line
print()

# Integers are right-aligned by default. We can use '<' to align to the left
for i in range(total_char):
    print('{0:<3d}'.format(i), end="")

S  o  m  e     c  o  l  l  e  c  t  i  o  n     o  f     w  o  r  d  s  
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Similar output but with the old formatting, using the __%__ operator.

In [3]:
for letter in text:
    print('%-3s' % letter, end="")
    
print()

for i in range(len(text)):
    index = i - 1
    print('%-3s' % i, end="")

S  o  m  e     c  o  l  l  e  c  t  i  o  n     o  f     w  o  r  d  s  
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 

<br></br>
String can be sliced. We define slicing from the start index to the last index we need, excluding the last. Also string can be printed backwards:

In [4]:
text[0:10]

'Some colle'

In [5]:
text[::-1]

'sdrow fo noitcelloc emoS'

More on the topic:
* Input and Output from the [Python tutorials](https://docs.python.org/3/tutorial/inputoutput.html)
* Various [String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods)
* Formatting the output [Tutorial](https://www.python-course.eu/python3_formatted_output.php)

#### Example
Let's look at some string methods and their implementation. Kaggle [Titanic](https://www.kaggle.com/c/titanic) challenge provides a data set which is split in two different sets (training and test sets). One of the columns provided in the data set contains the names of the passengers. At the first glance it may look like that the column is unusable, but by implementing feature engineering we can extract additional data from this column. If we look closely, we can spot that the names are in the format Last Name, Title, First (and sometimes also Middle name), ex: Braund, Mr. Owen Harris.  The data is in string format. We can extract the title from each row of names, and categorize them. One of the solutions available for such problem is shown below:

In [6]:
# Importing pandas library
import pandas as pd

# Loading the given data set to a pandas DataFrame
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<br></br>
Each name starts with a last name followed with a comma. There is a blank space after comma. The title ends with a period sign. Using find method, we can detect the position of the charachters we are looking for. In this case we are looking for comma (,) and period (.) characters

In [None]:
name = 'Braund, Mr. Owen Harris'

name.find(','), name.find('.')

In [None]:
name[name.find(',') + 2 : name.find('.')]

In [None]:
# Creating an empty list to store the titles
array = []
for i in df['Name']:
    name = i
    title = name[name.find(',') + 2 : name.find('.')]
    array.append(title)


df['title'] = array

# Checking the first 5 titles
df['title'][:5]