#### Data manupulation Steps
    - Data preparation
      - Loading
      - Assembling
        - Merging
        - Concatenating
        - Combining
      - Reshaping (pivoting)
      - Removing
    - Data transformation
    - Data aggregation

#### Data merging
    - merge() : the returned dataframe consists of all rows that have an ID in common. In addition to the common column, the columns from the first and the second dataframe are added.
    - if column name is common then add argument "on='column_name'" on which mergning need to perform.
      - otherwise result will be empty
    - if two dataframes in which the key columns do not have the same name.
      - use "left_on='column_name'" and 'right_on' option
      - pd.merge(frame1, frame2, left_on='id', right_on='sid')
    - join() : function, which is much more convenient when you want to do the merging by indexes. It can also be used to combine many dataframe objects having the same indexes but with no columns overlapping.
      - ex : frame1.join(frame2) 

In [30]:
import numpy as np
import pandas as pd

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],                
                      'price': [12.33,11.44,33.21,13.23,33.62]})
frame2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'],                
                      'color': ['white','red','red','black']})   
print("No Common column name = \n",pd.merge(frame1,frame2))

No Common column name = 
        id  price  color
0    ball  12.33    red
1  pencil  11.44  white
2  pencil  11.44    red
3     pen  33.21  black


In [31]:
frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],                
                      'color': ['white','red','red','black','green'],                
                      'brand': ['OMG','ABC','ABC','POD','POD']})
frame2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'],                
                      'brand': ['OMG','POD','ABC','POD']})  
pd.merge(frame1,frame2,on='brand') 

Unnamed: 0,id_x,color,brand,id_y
0,ball,white,OMG,pencil
1,pencil,red,ABC,ball
2,pen,red,ABC,ball
3,mug,black,POD,pencil
4,mug,black,POD,pen
5,ashtray,green,POD,pencil
6,ashtray,green,POD,pen


In [32]:
frame2.columns = ['sid','brand']
pd.merge(frame1, frame2, left_on='id', right_on='sid')

Unnamed: 0,id,color,brand_x,sid,brand_y
0,ball,white,OMG,ball,ABC
1,pencil,red,ABC,pencil,OMG
2,pencil,red,ABC,pencil,POD
3,pen,red,ABC,pen,POD


In [33]:
frame2.columns = ['id_2','brand_2']
frame1.join(frame2)

Unnamed: 0,id,color,brand,id_2,brand_2
0,ball,white,OMG,pencil,OMG
1,pencil,red,ABC,pencil,POD
2,pen,red,ABC,ball,ABC
3,mug,black,POD,pen,POD
4,ashtray,green,POD,,


#### string manupulation
    - text.split(',')       # use to split using ,
    - s.strip() strips the spaces       # example in code

    - 'join_character'.join(list_of_strings)
      - Ex : ';'.join(strings)
    - text.index('Boston') and text.find('Boston')    # use these to find the character in string
    - If substring not found 
      - index() function returns an error message
      - find() returns -1 if the substring is not found.
    - to replace text use "text.replace('Avenue','Street')"

In [34]:
text = '6 Bolton Avenue , Boston'
splitted_text = text.split(',')

print("splitted_text = ",splitted_text)

# one space is there after avenue. to overcome this use strip
splitted_text_without_space = [s.strip() for s in text.split(',')]
print("splitted_text_without_space = ",splitted_text_without_space)

strings = ['A+','A','A-','B','BB','BBB','C+'] 
';'.join(strings)

splitted_text =  ['6 Bolton Avenue ', ' Boston']
splitted_text_without_space =  ['6 Bolton Avenue', 'Boston']


'A+;A;A-;B;BB;BBB;C+'

In [35]:
print("text at index using index = ",text.index('Boston'))
print("text at index using find = ",text.find('Boston'))

text at index using index =  18
text at index using find =  18


#### Regular Expressions
    - Regular expressions provide a very flexible way to search and match string patterns within text. A single expression, generically called regex
      - first "import re"
    - for splitting use : 're.split(r'\s+', text)'
    - 'findall()'   # to find substring in the text
      - ex : re.findall(r'A\w+',text)        # search string with chanracter 'A'
      - re.findall(r'[A,a]\w+',text)         # will serach for text with 'A' or 'a'
    - search()      # to find the index for the character
      - returns only the first match
      - re.search(r'[A,a]\w+',text)
      - use 'start()' and 'end()' to get the start and end index
    - match() function performs matching only at the beginning of the string; 
      - if there is no match to the first character, it goes no farther in research within the string.
      - If you do not find a match, then it will not return any objects.

In [36]:
import re
re.split(r'\s+', text)

['6', 'Bolton', 'Avenue', ',', 'Boston']

In [37]:
# another way first compile it and then split
regex = re.compile(r'\s+')
regex.split(text)

['6', 'Bolton', 'Avenue', ',', 'Boston']

In [38]:
# search string
text = 'This is my address: 16 Bolton Avenue, Boston'
string_with_A = re.findall(r'A\w+',text)
string_with_A_or_a = re.findall(r'[A,a]\w+',text)

print("string_with_A = ",string_with_A)
print("string_with_A_or_a = ",string_with_A_or_a)

string_with_A =  ['Avenue']
string_with_A_or_a =  ['address', 'Avenue']


In [39]:
serach_api_call = re.search(r'[A,a]\w+',text)

print("serach_api_call = ",serach_api_call)

start_index = serach_api_call.start()
end_index = serach_api_call.end()

returned_text = text[start_index : end_index]

# alternatively you can call
# text[search.start():search.end()]

print("start_index= ",start_index, ',',"end_index = ",end_index)
print("returned_text = ",returned_text)

serach_api_call =  <re.Match object; span=(11, 18), match='address'>
start_index=  11 , end_index =  18
returned_text =  address
