## 데이터 크롤링 프로젝트

### 1. 데이터 수집 동기

<img src="./picture/word_count.png">

![0](./picture/mask1.png)

### 2. 크롤링하는 방법 - 스크래피

#### 1. 프로젝트 생성


#### 2. Items 설정
#### gmarket
```Python
class GmarketItem(scrapy.Item):
    title = scrapy.Field()
    s_price = scrapy.Field()
    link = scrapy.Field()
  
```
####  naver
```python
class NaverNewsItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
```

#### 3. spider 작성
#### gmarket
```Python
class GmarketSpider(scrapy.Spider):
    name = "gmarket"
    allow_domain = ["http://www.gmarket.co.kr"]
    start_urls = ["http://corners.gmarket.co.kr/Bestsellers?viewType=G&groupCode=G08"] #living 카테고리에서 시작
    
    def parse(self, response):
        items = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li')
        for item in items:
            link = item.xpath("./a/@href").extract()[0] #괄호 없애줌
            yield scrapy.Request(link, callback=self.parse_page_contents)
            
    def parse_page_contents(self, response):
        item = GmarketItem()
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/h1/text()').extract()[0].strip()
        s_price = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()').extract()
        item["s_price"] = [item.replace(",", "") for item in s_price][0] #개당 가격을 계산하기 위해 , 제거
        item["link"] = response.url
        yield item
        
```

#### naver

```python
class Spider(scrapy.Spider):
    name = "NaverNews"
    allow_domain = ["naver.com"]
    start_urls = ["https://news.naver.com/main/home.nhn"]

    def parse(self, response):
        item = NaverNewsItem()
        link_p = response.xpath('//*[@id="section_politics"]/div[2]/div/ul/li/a/@href').extract() #정치
        link_e = response.xpath('//*[@id="section_economy"]/div[2]/div/ul/li/a/@href').extract() #경제
        link_s = response.xpath('//*[@id="section_society"]/div[2]/div/ul/li/a/@href').extract() #사회
        urls = response.xpath('//*[@id="today_main_news"]/div[2]/ul/li/div[1]/a/@href').extract() #헤드라인
        
        link_h = []
        for url in urls:
            link_h.append(response.urljoin(url)) #start_urls과 헤드라인 join
            
        links = link_p + link_e + link_s + link_h #링크 합치기
        
        for link in links:
            yield scrapy.Request(link, callback=self.page_content)
            
    def page_content(self, response):
        item = NaverNewsItem()
        item["title"] = response.xpath('//*[@id="articleTitle"]/text()')[0].extract()
        item["link"] = response.url

        yield item
```

#### 4. pipeline 설정하기

#### gmarket
1. slack으로 이름, 단위당 가격, 판매가격, 링크 보내기

```python
class CrawlerPipeline(object):
    
    def __send_slack(self, msg):
        WEBHOOK_URL = "https://hooks.slack.com/services/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"
        payload = {
            "channel" : "#dk3059",
            "username" : "KDK",
            "icon_emoji" : ":crn:",
            "text" : msg,
        }
        requests.post(WEBHOOK_URL, json.dumps(payload))
    
    
    def process_item(self, item, spider):
        keyword = "마스크"

        if keyword in item["title"]:
            try : 
                num = re.findall('\s([0-9]{2,3})\S.', item["title"])[0] #정규표현식을 사용하여 이름에서 단위 추출
                
            except:
                num = 1
            
            num_p = round((int(item['s_price'])/int(num))) #개당 가격
            
            if num_p < 4000: #개당 가격이 4000원 미만일 때 슬렉으로 보내기
                
                print(item['title'])
                self.__send_slack("*{}*, {}, `{}`, {}, {}".format( 
                "gmarket", item['title'], num_p, item['s_price'], item['link'])) #gmarket, 개당 가격 강조 
           
        return item
```


2. mongodb에 데이터 저장하기

```python
class CrawlerPipeline2(object):
    def process_item(self, item, spider):
        
        data = {"title" : item["title"],
                "s_price" : item["s_price"],
                "link" : item["link"]}
        
        collection.insert(data)
        
        return item
```


#### naver
1. slack으로 기사 제목, url 보내기

```python
class CrawlerPipeline(object):
    
    def __send_slack(self, msg):
        WEBHOOK_URL = "https://hooks.slack.com/services/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"
        payload = {
            "channel" : "#dk3059",
            "username" : "KDK",
            "text" : msg,
        }
        requests.post(WEBHOOK_URL, json.dumps(payload))

        
    
    def process_item(self, item, spider):
        keyword = "코로나"
        keyword2 = "마스크"
        if keyword in item["title"] or keyword2 in item["title"]:
            self.__send_slack("*{}*, {}".format(item['title'], item['link'])) #코로나, 마스크가 포함된 기사제목만 볼드체
        else :
            self.__send_slack("{}, {}".format(item['title'], item['link']))
        return item
```
2. mongodb에 저장하기

```python
class CrawlerPipeline2(object):
    def process_item(self, item, spider):
        
        data = {"title" : item["title"],
                "link" : item["link"]}
        
        collection.insert(data)
        return item
```

#### gmarket

```python
!echo "ITEM_PIPELINES = {" >> gmarket_living/gmarket_living/settings.py
!echo "    'gmarket_living.pipelines.CrawlerPipeline' : 300," >> gmarket_living/gmarket_living/settings.py
!echo "    'gmarket_living.pipelines.CrawlerPipeline2' : 301," >> gmarket_living/gmarket_living/settings.py
!echo "}" >> gmarket_living/gmarket_living/settings.py
```

#### 5. scrapy 실행

1. 슬랙
<img src="./picture/slack.png">
<img src="./picture/slack_n.png">

2. mongodb

<img src="./picture/robo_n.png">

### coupang

```python
%%writefile coupang.py
from pyvirtualdisplay import Display 
from selenium import webdriver 
import pandas as pd
import re
import pymongo

class coupang():

    def __init__(self, keyword):
        self.data = {}
        self.get_links(keyword)
        self.mongo()
        
    def get_links(self, keyword):
        datas = []
        url = "https://www.coupang.com/np/search?q={}&channel=recent".format(keyword)
        display = Display(visible=0, size=(800, 600)) 
        display.start() 
        driver = webdriver.Chrome() 
        driver.get(url) 
        elements = driver.find_elements_by_xpath('//*[@id="productList"]/li/a')
        links = [element.get_attribute("href") for element in elements]
        driver.quit()
        display.stop()
        for link in links:
            display = Display(visible=0, size=(800, 600)) 
            display.start() 
            driver = webdriver.Chrome() 
            driver.get(link) 
            elements = driver.find_elements_by_xpath('//*[@id="contents"]/div[1]/div/div[3]/div[3]/h2')
            title = [element.text for element in elements][0]
            elements1 = driver.find_elements_by_xpath('//*[@id="contents"]/div[1]/div/div[3]/div[5]/div[1]/div/div[2]/span[1]/strong')
            try:
                s_price = [element.text.replace(",", "")[:-1] for element in elements1][0] #품절인 경우
            except :
                s_price = 0
            elements2 = driver.find_elements_by_xpath('//*[@id="contents"]/div[1]/div/div[3]/div[6]/div[2]/div/div[1]/div[1]/span')
            d_price = [element.text.replace(",", "") for element in elements2][0]
            if re.findall('\s([0-9]{4,6})\S+', d_price):
                d_price = re.findall('\s([0-9]{4,6})\S+', d_price)[0] #배송비만 가져옴
            else :
                d_price = 0
            price = int(s_price) + int(d_price) #배송가격과 판매가격을 더해줌
            datas.append({"title" : title, "price" : price, "link" : link})
            driver.quit()
            display.stop()
        self.data = datas
    

    def mongo(self): #mongodb에 저장
        client = pymongo.MongoClient('mongodb://15.165.218.144:27017/')
        db = client.coupang
        collection = db.mask
        collection.insert(self.data)
        
coupang("마스크")
```

<img src="./delivery.png">

<img src="./picture/pd_coupang.png">

#### mongodb 저장 확인

<img src="./picture/robo3_coupang.png">

### 3. 데이터 프레임 가공

#### 1. 데이터 프레임 불러오기
```python
client = pymongo.MongoClient('mongodb://15.165.218.144:27017/')
db = client.coupang
collection = db.mask
items = collection.find({})
df = pd.DataFrame(items)
df = df[['title', 'price']]
```

#### 2. 데이터 전처리
1. 제목에 단위 추출하기

```python
df['unit'] = df['title'].apply(lambda data : int(re.findall('\s([0-9]{2,3})\S+', data)[0]) if re.findall('\s([0-9]{2,3})\S+', data) else 1)
```

2. 단위당 가격 추출하기

```python
df['u_price'] = round(df['price'] / df['unit'])
df['n_price'] =df['u_price'].apply(lambda x : 500 if 0<x<1000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 1000 if 1000<x<2000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 2000 if 2000<x<3000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 3000 if 3000<x<4000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 4000 if 4000<x<5000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 5000 if 5000<x<6000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 6000 if 6000<x<7000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 7000 if 7000<x<8000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 8000 if 8000<x<9000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 9000 if 9000<x<10000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 10000 if 10000<x<20000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 20000 if 20000<x<30000 else x)
df['n_price'] =df['n_price'].apply(lambda x : 30000 if 30000<x else x)
```

3. KF인지 아닌지 확인

```python
df['KF'] = df['title'].apply(lambda data : 1 if re.findall('([K,F,N]{2})', data) else 0)
```

### 데이터 프레임

<img src="./picture/df.png">

### KF와 일회용 마스크의 가격비교

### 1. Total
<img src="./picture/Total.png">

### 2.KF, KN
<img src="./picture/KF.png">

### 3. 일회용
<img src="./picture/n_KF.png">

### 4. 프로젝트 회고
 - 차후 주기적으로 크롤링을 통해 마스크5부제나 추우 정부 정책의 성공여부를 판단