pandas 모듈에서는 dataframe을 선언하는 것이 매우 중요하다. 올바른 형태를 가진 dataframe을 가진 상태에서만 그 이후의 행동이 가능해진다. 
코딩을 하다보면, pandas에서 서로 다른 두 dataframe을 합쳐야 하는 경우를 매우 흔하게 볼 수 있다.
이 때 사용할 수 있는 command가 총 세 가지가 있다. 
1) concatenate - 두 개의 dataframe을 위아래로 이어붙인다고 생각하면 된다. 
2) join - join 은 두 개의 df를 양 옆으로 이어붙인다. 
3) merge - merge() 함수는 서로 다른 두 데이터프레임을 각 데이터에 존재하는 고유값을 기준으로 병합할 때 사용한다. 
pd.merge(df_left,df_right,how='inner', on = None)

In [2]:
# This is a notebook for the coding implementation.
import pandas as pd 
import numpy as np 

dict_a = {
    'id1': [1, 2, 3, 4, 5],
    'name': ['a', 'b', 'c', 'd', 'e'],
    'price': [10, 20, 30, 40, 50]
}

dict_b = {
    'id2': [1, 2, 3, 4],
    'name': ['a', 'b', 'z', 'z'],
    'price': [10, 20, 100, 100]
}

df_a = pd.DataFrame(dict_a)
df_b = pd.DataFrame(dict_b)

print(dict_a)
print(dict_b)

df_test = df_a.join(df_b,lsuffix ='l_',rsuffix ='r_')
print(df_test)


{'id1': [1, 2, 3, 4, 5], 'name': ['a', 'b', 'c', 'd', 'e'], 'price': [10, 20, 30, 40, 50]}
{'id2': [1, 2, 3, 4], 'name': ['a', 'b', 'z', 'z'], 'price': [10, 20, 100, 100]}
   id1 namel_  pricel_  id2 namer_  pricer_
0    1      a       10  1.0      a     10.0
1    2      b       20  2.0      b     20.0
2    3      c       30  3.0      z    100.0
3    4      d       40  4.0      z    100.0
4    5      e       50  NaN    NaN      NaN


In [3]:
# This cell is about pandas.apply and lambda. 
import numpy as np 
import pandas as pd

df = pd.DataFrame([[4,9],[1,4],[5,6]],columns = ['A','B'])
print(df)

# Let's say we want to apply certain function on the column A. 
def plus_one(x):
    x += 1
    return x

df['A'] = df['A'].apply(plus_one)
print(df)

# Motivation - how do we simplify this process?
# Initialize the dataframe
df = pd.DataFrame([[4,9],[1,4],[5,6]],columns = ['A','B'])
df = df.apply(lambda a : a+1 )
print(df)

   A  B
0  4  9
1  1  4
2  5  6
   A  B
0  5  9
1  2  4
2  6  6
   A   B
0  5  10
1  2   5
2  6   7


In [4]:
# This cell is about string.maketrans()
import string

obj = 'python'
before = 'ython'
after = 'conda'
sen = obj.maketrans(before,after)
print(obj.translate(sen)) # This function is substituting the string with another string.
# The prerequisite for this function is the before and after should have the same length.

pconda


In [5]:
# This cell is about re.sub() function. 
# Example of this function is as follows. 
import re 

text = 'I like abple and abple'
text_mod = re.sub('abple','apple',text)

print(text_mod) # So this function is replacing certain words with another words.

I like apple and apple


In [6]:
# This cell is about removing the repeated characters.
import re # Regular expression 
text = 'Publish or Periiish'
text_sub = re.sub(r'(.)1+',r'1',text) # +1 means that if the string is repeated more than 1 time, it is removed. 
#text_sub = re.sub(r'(.)1+',r'1',text)
print(text_sub)

Publish or Periiish


https://cosmosproject.tistory.com/180
The site above explains about the syntax of the re.sub() function.
https://www.nextree.co.kr/p4327/
This site contains nice reference for regular expression.

In [7]:
import nltk
from nltk.tokenize import RegexpTokenizer
# Regular expression is very useful for text processing.
# But, it will take some time to be familiar with re.
text = 'Publish or Perish!'
#tokenizer = RegexpTokenizer(r"w+", gaps = True)
tokenizer = RegexpTokenizer(r'\w+')
# For the second argument, we can specify the criteria for the tokenization.
print(tokenizer.tokenize(text))

['Publish', 'or', 'Perish']


In [8]:
# In this cell, I tried to observe the difference between stemming and lemmatizing.
import nltk
nltk.download('wordnet')
from nltk import WordNetLemmatizer
import os

dir = r'C:\Program Files (x86)\WordNet\2.1'
os.chdir(dir)

stemmizer = nltk.PorterStemmer()
text = 'The greatest glory in living lies not in never falling, but in rising every time we fall.'
text = tokenizer.tokenize(text)

text_stem = [stemmizer.stem(word) for word in text]
# In order to use for loop, we need to put those in the list.
print((text_stem))
# The result is not 100% accurate.
# How about lemmatizer?
lemmatizer = nltk.WordNetLemmatizer()
text_lem = [lemmatizer.lemmatize(word) for word in text]
print(text_lem)

['the', 'greatest', 'glori', 'in', 'live', 'lie', 'not', 'in', 'never', 'fall', 'but', 'in', 'rise', 'everi', 'time', 'we', 'fall']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\every\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['The', 'greatest', 'glory', 'in', 'living', 'lie', 'not', 'in', 'never', 'falling', 'but', 'in', 'rising', 'every', 'time', 'we', 'fall']
