# Refine the Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('last_news_final.csv')

In [3]:
df.head()

Unnamed: 0,title,date,url
0,Nur Otan-да екі аудандағы АӘК-тің таратылуында...,22:27 2019-08-17,https://kaz.zakon.kz/news/4981980-nur-otan-da-...
1,Асқар Мамин Арыс қаласын қалпына келтіру жұмыс...,18:16 2019-08-17,https://kaz.zakon.kz/news/4981971-as-ar-mamin-...
2,Түркістанда шығыстық үлгідегі автобекет ашылды,15:25 2019-08-17,https://kaz.zakon.kz/news/4981963-t-rk-standa-...
3,Бүгін арыстықтар 100 мың теңге жәрдемақыны алды,20:21 2019-08-16,https://kaz.zakon.kz/news/4981913-b-g-n-arysty...
4,АҚШ-тың санкциялары Ресейдің мұнайына деген сұ...,17:54 2019-08-16,https://kaz.zakon.kz/news/4981898-a-sh-ty-sank...


To get the date of the title - we will need the following algorithm
- If the string contains **hours** we can consider it **1 day**
- And if the string has **day**, we pick the number preceding the **day**

To apply this algorithm, we need to be able to pick these words and digits from a string. For that we will use Regular Expression.

## Introduction to Regular Expression (Regex)

Regular expression is a way of selecting text using symbols in a string.

Refer to the following links for an interactive playground
- [http://regexr.com](http://regexr.com/)
- [http://regex101.com/](http://regex101.com/)

In [4]:
import re

In [5]:
test_string = "Hello world, welcome to 2016."

In [6]:
# We can pass the whole string and re.search will give the first occurence of the value
# re.search - This function searches for first occurrence of RE pattern within string.
a = re.search('Hello world, welcome to 2016',test_string)

In [7]:
a

<re.Match object; span=(0, 28), match='Hello world, welcome to 2016'>

In [8]:
a.group()

'Hello world, welcome to 2016'

In [40]:
# Match the first letters in the string
a = re.search('.',test_string)
a.group()

'H'

In [9]:
# Match all the letters in the string
a = re.search('.*',test_string)
a.group()

'Hello world, welcome to 2016.'

In [10]:
a = re.search('Hello',test_string)
print(a)

<re.Match object; span=(0, 5), match='Hello'>


** Some basic symbols**

**`?`**   

The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

**`\*`**

The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

**`\+`**	
The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".


In [43]:
a = re.search('\w.',test_string)
print(a)

<_sre.SRE_Match object; span=(0, 2), match='He'>


In [44]:
a = re.search('\w*',test_string)
print(a)

<_sre.SRE_Match object; span=(0, 5), match='Hello'>


### Exercises

In [45]:
string = '''In 2016, we are learning Text Analytics in Data Science 101
            by scraping http://datatau.com'''

In [46]:
string = "In 2016, we are learning Text Analytics in Data Science 101 by scraping http://datatau.com"

Write a regex to pick the numbers 2016 from string above.

Write a regex to pick the url link (http://xyz.com) from the string above 

## Lets get the date from our string

In [11]:
df.head()

Unnamed: 0,title,date,url
0,Nur Otan-да екі аудандағы АӘК-тің таратылуында...,22:27 2019-08-17,https://kaz.zakon.kz/news/4981980-nur-otan-da-...
1,Асқар Мамин Арыс қаласын қалпына келтіру жұмыс...,18:16 2019-08-17,https://kaz.zakon.kz/news/4981971-as-ar-mamin-...
2,Түркістанда шығыстық үлгідегі автобекет ашылды,15:25 2019-08-17,https://kaz.zakon.kz/news/4981963-t-rk-standa-...
3,Бүгін арыстықтар 100 мың теңге жәрдемақыны алды,20:21 2019-08-16,https://kaz.zakon.kz/news/4981913-b-g-n-arysty...
4,АҚШ-тың санкциялары Ресейдің мұнайына деген сұ...,17:54 2019-08-16,https://kaz.zakon.kz/news/4981898-a-sh-ty-sank...


In [12]:
df.tail()

Unnamed: 0,title,date,url
130,Тоқаев Сооронбай Жээнбековпен телефон арқылы с...,10:03 2019-08-09,https://kaz.zakon.kz/news/4980848-to-aev-sooro...
131,Дайын отыр! Атамбаевты неге жығып бермеу керек,09:50 2019-08-09,https://kaz.zakon.kz/news/4980845-dayyn-otyr-a...
132,Диана кайф. Жалаңаштанған қазақ қызы жұрттың ж...,09:31 2019-08-09,https://kaz.zakon.kz/news/4980839-diana-kayf-z...
133,Алматыда Toyota адам қағып кетті,09:18 2019-08-09,https://kaz.zakon.kz/news/4980835-almatyda-toy...
134,Атамбаев өз еркімен берілді,08:59 2019-08-09,https://kaz.zakon.kz/news/4980831-atambaev-z-e...


In [13]:
date_string = df['date'][0]

In [14]:
print(date_string)

22:27 2019-08-17


In [15]:
re.search('hours',date_string)

In [16]:
date_string = df['date'][50]

In [53]:
print(date_string)

4 points by lefish 7 days ago  | discuss


In [54]:
# If hours is not there, we don't get any match
re.search('hours',date_string)

In [55]:
# Let us match the digit preceding the day text
day_search = re.search('\d+ day',date_string)
day_search

<_sre.SRE_Match object; span=(19, 24), match='7 day'>

In [56]:
days_string = day_search.group(0)
days_string

'7 day'

In [57]:
days = days_string.split(' ')[0] 
days

'7'

In [58]:
def return_reg_ex_days(row):
    days = ''
    if re.search('hours',row['date']) is not None:
        # print('hours',row['date'])
        days = 1
    else:
        day_search = re.search('\d+ day',row['date'])
        # print('day',day_search.group(0))
        days = day_search.group(0).split(' ')[0]    
    
    #print(row,days)
    return days
        

In [59]:
# Now we apply this function to each of the row in the dataframe
df['days'] = df.apply(return_reg_ex_days,axis=1)

In [60]:
df.head()

Unnamed: 0,title,date,days
0,"An Exploration of R, Yelp, and the Search for ...",5 points by Rogerh91 6 hours ago | discuss,1
1,Deep Advances in Generative Modeling,7 points by gwulfs 15 hours ago | 1 comment,1
2,Spark Pipelines: Elegant Yet Powerful,3 points by aouyang1 9 hours ago | discuss,1
3,Shit VCs Say,3 points by Argentum01 10 hours ago | discuss,1
4,"Python, Machine Learning, and Language Wars",4 points by pmigdal 17 hours ago | discuss,1


In [61]:
df.tail()

Unnamed: 0,title,date,days
175,Getting Started with Statistics for Data Science,3 points by nickhould 35 days ago | discuss,35
176,Rodeo 1.3 - Tab-completion for docstrings,3 points by glamp 35 days ago | discuss,35
177,Teaching D3.js - links,3 points by pmigdal 35 days ago | discuss,35
178,Parallel scikit-learn on YARN,5 points by stijntonk 39 days ago | discuss,39
179,Meetup: Free Live Webinar on Prescriptive Anal...,2 points by ann928 32 days ago | discuss,32


In [62]:
# Let us save to a dataframe
df.to_csv('data_tau_days.csv', index=False)