### Netflix EDA
- 아래의 서비스에서 데이터를 수집하여 EDA를 수행하세요.


- flixpatrol 사이트에서 데이터 수집
    - `https://flixpatrol.com/top10/netflix/world/2021/full/#netflix-1`


- kaggle에서 netflix 컨텐츠 데이터 수집
    - `https://www.kaggle.com/shivamb/netflix-shows`


- 컨텐츠의 등급데이터는 아래의 코드 사용
```
ratings_ages = {'TV-PG': 'Older Kids', 'TV-MA': 'Adults', 'TV-Y7-FV': 'Older Kids',
                  'TV-Y7': 'Older Kids', 'TV-14': 'Teens', 'R': 'Adults', 'TV-Y': 'Kids',
                  'NR': 'Adults', 'PG-13': 'Teens', 'TV-G': 'Kids', 'PG': 'Older Kids',
                  'G': 'Kids', 'UR': 'Adults', 'NC-17': 'Adults'}
```

#### EDA의 수행 절차
- 데이터 수집
- 결측 데이터의 처리
- 데이터 탐색
    - 수집한 데이터에서 자유롭게 주제를 선정하여 데이터 분석을 통한 인사이트 도출
    - 예시 : 인도영화는 다른 국가보다 러닝타임이 길것이다. 컨텐츠의 퀄리티가 가장 좋은 국가는?

### 한국 Top10 영화 데이터 수집

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
# 1. 웹서비스 분석 : URL

In [3]:
country = "south-korea"
year = "2021"
url = f"https://flixpatrol.com/top10/netflix/{country}/{year}/"
url

'https://flixpatrol.com/top10/netflix/south-korea/2021/'

In [4]:
# 2. request(url) > response(data) : data(html) *이때, html문자열은 한페이지 모두

In [5]:
response = requests.get(url)
response

<Response [200]>

In [6]:
response.text[:500]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n\t<title>TOP 10 on Netflix in South Korea in 2021 • FlixPatrol</title>\n\t<link rel="preload" as="font" type="font/woff2" crossorigin href="/static/fonts/Inter-roman.var.woff2?v=3.18" nonce="h1scZX6pgK44s4WFbx6Qvg==">\n\t<link rel="stylesheet" type="text/css" href="/static/dist/all.min.css?v=ec3a8b3a" nonce'

In [7]:
# 3. data(html) > bs_obj.select(css-selector) > text 

In [8]:
dom = BeautifulSoup(response.text, "html.parser")

In [9]:
elements = dom.select("#netflix-1 > div.-mx-content > div > div > table > tbody > tr")
len(elements)

10

In [10]:
# 각 데이터에서 필요한 정보 수집
element = elements[0]
data = {
    #strip() : 앞뒤 공백 문자 제거
    "no":element.select("td")[0].text.strip(),
    "title":element.select("td")[1].select_one("a").text.strip(),
    "points":element.select("td")[2].text.strip(),

}

data

{'no': '1.', 'title': '365 Days', 'points': '573'}

In [11]:
# [{row1}, {row2}, ...] 형태의 데이터프레임으로 만들어주기
data = []
for element in elements:
    data.append({
        "no":element.select("td")[0].text.strip(),
        "title":element.select("td")[1].select_one("a").text.strip(),
        "points":element.select("td")[2].text.strip(),
    })
df = pd.DataFrame(data)
df.tail(2)

Unnamed: 0,no,title,points
8,9.0,Parasite,241
9,10.0,Samjin Company English Class,238


In [12]:
# function 함수화 : 국가별 Top10 영화 데이터 수집

In [13]:
def topmovies(country, year):
    # 1. 웹서비스 분석 : URL
    url = f"https://flixpatrol.com/top10/netflix/{country}/{year}/"
    # 2. request(url) > response(data) : data(html) *이때, html문자열은 한페이지 모두
    response = requests.get(url)
    # 3. data(html) > bs_obj.select(css-selector) > text 
    dom = BeautifulSoup(response.text, "html.parser")
    elements = dom.select("#netflix-1 > div.-mx-content > div > div > table > tbody > tr")
    element = elements[0]
    # [{row1}, {row2}, ...] 형태의 데이터프레임으로 만들어주기
    data = []
    for element in elements:
        data.append({
            "no":element.select("td")[0].text.strip(),
            "title":element.select("td")[1].select_one("a").text.strip(),
            "points":element.select("td")[2].text.strip(),
        })
    df = pd.DataFrame(data)
    return df

In [14]:
country = "united-states"
year = "2022"
topmovies(country, year)

Unnamed: 0,no,title,points
0,1.0,Don't Look Up,265
1,2.0,Despicable Me 2,173
2,3.0,The Tinder Swindler,165
3,4.0,Just Go with It,148
4,5.0,The Royal Treatment,117
5,6.0,Journey 2: The Mysterious Island,109
6,7.0,Despicable Me,107
7,8.0,Brazen,103
8,9.0,Home Team,99
9,10.0,The Longest Yard,97


### 한국 Top10 드라마 데이터 수집

In [15]:
# 1. 웹서비스 분석 : URL

In [16]:
url = "https://flixpatrol.com/top10/netflix/south-korea/2021/"

In [17]:
# 2. request(url) > response(data) : data(html) *이때, html문자열은 한페이지 모두

In [18]:
response = requests.get(url)
response

<Response [200]>

In [19]:
response.text[:500]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n\t<title>TOP 10 on Netflix in South Korea in 2021 • FlixPatrol</title>\n\t<link rel="preload" as="font" type="font/woff2" crossorigin href="/static/fonts/Inter-roman.var.woff2?v=3.18" nonce="ZoGOsgvdfiji7ixJEvd+/w==">\n\t<link rel="stylesheet" type="text/css" href="/static/dist/all.min.css?v=ec3a8b3a" nonce'

In [20]:
# 3. data(html) > bs_obj.select(css-selector) > text 

In [21]:
dom = BeautifulSoup(response.text, "html.parser")

In [22]:
elements = dom.select("#netflix-2 > div.-mx-content > div > div > table > tbody > tr")
len(elements)

10

In [23]:
# 각 데이터에서 필요한 정보 수집
element = elements[0]
data = {
    #strip() : 앞뒤 공백 문자 제거
    "no":element.select("td")[0].text.strip(),
    "title":element.select("td")[1].select_one("a").text.strip(),
    "points":element.select("td")[2].text.strip(),

}

data

{'no': '1.', 'title': 'Hospital Playlist', 'points': '1,215'}

In [24]:
# [{row1}, {row2}, ...] 형태의 데이터프레임으로 만들어주기
data = []
for element in elements:
    data.append({
        "no":element.select("td")[0].text.strip(),
        "title":element.select("td")[1].select_one("a").text.strip(),
        "points":element.select("td")[2].text.strip(),
    })
df = pd.DataFrame(data)
df.tail(2)

Unnamed: 0,no,title,points
8,9.0,Sisyphus: The Myth,542
9,10.0,The King's Affection,540


In [25]:
def topshows(country, year):
    # 1. 웹서비스 분석 : URL
    url = f"https://flixpatrol.com/top10/netflix/{country}/{year}/"
    # 2. request(url) > response(data) : data(html) *이때, html문자열은 한페이지 모두
    response = requests.get(url)
    # 3. data(html) > bs_obj.select(css-selector) > text 
    dom = BeautifulSoup(response.text, "html.parser")
    elements = dom.select("#netflix-2 > div.-mx-content > div > div > table > tbody > tr")
    element = elements[0]
    # [{row1}, {row2}, ...] 형태의 데이터프레임으로 만들어주기
    data = []
    for element in elements:
        data.append({
            "no":element.select("td")[0].text.strip(),
            "title":element.select("td")[1].select_one("a").text.strip(),
            "points":element.select("td")[2].text.strip(),
        })

    return pd.DataFrame(data)

In [26]:
country = "united-states"
year = "2022"
topshows(country, year)

Unnamed: 0,no,title,points
0,1.0,Ozark,262
1,2.0,Cobra Kai,230
2,3.0,All of Us are Dead,180
3,4.0,Stay Close,172
4,5.0,The Witcher,171
5,6.0,Archive 81,167
6,7.0,Sweet Magnolias,152
7,8.0,CoComelon,151
8,9.0,Cheer,133
9,10.0,The Woman in the House Across the Street from ...,128


### Total Movies on Netflix in 2021

In [42]:
def totalmovies(year):
    # 1. 웹서비스 분석 : URL
    url = f"https://flixpatrol.com/top10/netflix/world/{year}/full/"
    # 2. request(url) > response(data) : data(html) *이때, html문자열은 한페이지 모두
    response = requests.get(url)
    # 3. data(html) > bs_obj.select(css-selector) > text 
    dom = BeautifulSoup(response.text, "html.parser")
    elements = dom.select("#netflix-1 > div.-mx-content > div > div > table > tbody > tr")
    element = elements[0]
    # [{row1}, {row2}, ...] 형태의 데이터프레임으로 만들어주기
    data = []
    for element in elements:
        data.append({
            "title" : element.select("td")[2].text.strip(),
            "point" : element.select("td")[3].text.strip(),
            "countries" : element.select("td")[5].text.strip(),
            "point/countries" : element.select("td")[6].text.strip(),
            "days" : element.select("td")[7].text.strip(),
            "point/days" : element.select("td")[8].text.strip(),
        })
    totalmovies = pd.DataFrame(data)
    return totalmovies

In [43]:
year = "2021"
totalmovies = totalmovies(year)


### Total TV shows on Netflix in 2021

In [29]:
def totalshows(year):
    # 1. 웹서비스 분석 : URL
    url = f"https://flixpatrol.com/top10/netflix/world/{year}/full/"
    # 2. request(url) > response(data) : data(html) *이때, html문자열은 한페이지 모두
    response = requests.get(url)
    # 3. data(html) > bs_obj.select(css-selector) > text 
    dom = BeautifulSoup(response.text, "html.parser")
    elements = dom.select("#netflix-2 > div.-mx-content > div > div > table > tbody > tr")
    element = elements[0]
    # [{row1}, {row2}, ...] 형태의 데이터프레임으로 만들어주기
    data = []
    for element in elements:
        data.append({
           "title" : element.select("td")[2].text.strip(),
            "point" : element.select("td")[3].text.strip(),
            "countries" : element.select("td")[5].text.strip(),
            "point/countries" : element.select("td")[6].text.strip(),
            "days" : element.select("td")[7].text.strip(),
            "point/days" : element.select("td")[8].text.strip(),
        })
    totalshows = pd.DataFrame(data)
    return totalshows

In [47]:
year = "2021"
totalshows(year)

TypeError: 'DataFrame' object is not callable

### flixpatrol + kaggle merge

In [31]:
netflix = pd.read_csv("netflix_titles.csv")
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [32]:
new = pd.merge(totalmovies, netflix, on = "title")
new

Unnamed: 0,title,point,countries,point/countries,days,point/days,show_id,type,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,We Can Be Heroes,25311,82,309,214,118,s1495,Movie,Robert Rodriguez,"YaYa Gosselin, Pedro Pascal, Priyanka Chopra, ...",United States,"December 25, 2020",2020,PG,101 min,"Children & Family Movies, Comedies",When alien invaders capture Earth’s superheroe...
1,Army of the Dead,18888,89,212,92,205,s854,Movie,Zack Snyder,"Dave Bautista, Ella Purnell, Omari Hardwick, G...",United States,"May 21, 2021",2021,R,148 min,"Action & Adventure, Horror Movies","After a zombie outbreak in Las Vegas, a group ..."
2,Wish Dragon,16953,82,207,88,193,s740,Movie,Chris Appelhans,"Jimmy Wong, John Cho, Constance Wu, Will Yun L...","China, United States, Canada","June 11, 2021",2021,PG,102 min,"Children & Family Movies, Comedies",Determined teen Din is longing to reconnect wi...
3,The Mitchells vs. The Machines,14902,82,182,74,201,s962,Movie,"Mike Rianda, Jeff Rowe","Danny McBride, Abbi Jacobson, Maya Rudolph, Mi...",,"April 30, 2021",2021,PG,114 min,"Children & Family Movies, Comedies",A robot apocalypse put the brakes on their cro...
4,Fatherhood,14660,82,179,54,271,s686,Movie,Paul Weitz,"Kevin Hart, Alfre Woodard, Lil Rel Howery, DeW...",United States,"June 18, 2021",2021,PG-13,111 min,Dramas,"A widowed new dad copes with doubts, fears, he..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,2012,1886,31,61,51,37,s1114,Movie,Roland Emmerich,"John Cusack, Amanda Peet, Chiwetel Ejiofor, Th...",United States,"April 1, 2021",2009,PG-13,158 min,"Action & Adventure, Sci-Fi & Fantasy",When a flood of natural disasters begins to de...
85,Wanted,1878,32,59,20,94,s633,Movie,Nibal Arakji,"Daad Rizk, Georges Diab, Sihame Haddad, George...",,"June 28, 2021",2019,TV-14,90 min,"Comedies, International Movies",Four seniors embark on misadventures after bre...
86,Just Say Yes,1878,43,44,16,117,s1110,Movie,"Appie Boudellah, Aram van de Rest","Yolanthe Cabau, Noortje Herlaar, Kim-Lian van ...",Netherlands,"April 2, 2021",2021,TV-MA,98 min,"Comedies, International Movies, Romantic Movies",Incurable romantic Lotte finds her life upende...
87,Ava,186,4,47,29,6,s1585,Movie,Tate Taylor,"Jessica Chastain, Colin Farrell, John Malkovic...",United States,"December 7, 2020",2020,R,97 min,"Action & Adventure, Dramas",An elite assassin wrestling with doubts about ...


In [33]:
new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89 entries, 0 to 88
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            89 non-null     object
 1   point            89 non-null     object
 2   countries        89 non-null     object
 3   point/countries  89 non-null     object
 4   days             89 non-null     object
 5   point/days       89 non-null     object
 6   show_id          89 non-null     object
 7   type             89 non-null     object
 8   director         88 non-null     object
 9   cast             88 non-null     object
 10  country          65 non-null     object
 11  date_added       89 non-null     object
 12  release_year     89 non-null     int64 
 13  rating           89 non-null     object
 14  duration         89 non-null     object
 15  listed_in        89 non-null     object
 16  description      89 non-null     object
dtypes: int64(1), object(16)
memory usage:

In [34]:
netflix.loc[netflix["type"] == "TV Show"].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2676 entries, 1 to 8803
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       2676 non-null   object
 1   type          2676 non-null   object
 2   title         2676 non-null   object
 3   director      230 non-null    object
 4   cast          2326 non-null   object
 5   country       2285 non-null   object
 6   date_added    2666 non-null   object
 7   release_year  2676 non-null   int64 
 8   rating        2674 non-null   object
 9   duration      2676 non-null   object
 10  listed_in     2676 non-null   object
 11  description   2676 non-null   object
dtypes: int64(1), object(11)
memory usage: 271.8+ KB


In [35]:
netflix.loc[netflix["type"] == "Movie"].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6131 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6131 non-null   object
 1   type          6131 non-null   object
 2   title         6131 non-null   object
 3   director      5943 non-null   object
 4   cast          5656 non-null   object
 5   country       5691 non-null   object
 6   date_added    6131 non-null   object
 7   release_year  6131 non-null   int64 
 8   rating        6129 non-null   object
 9   duration      6128 non-null   object
 10  listed_in     6131 non-null   object
 11  description   6131 non-null   object
dtypes: int64(1), object(11)
memory usage: 622.7+ KB


In [36]:
ratings_ages = {'TV-PG': 'Older Kids', 'TV-MA': 'Adults', 'TV-Y7-FV': 'Older Kids',
                'TV-Y7': 'Older Kids', 'TV-14': 'Teens', 'R': 'Adults', 'TV-Y': 'Kids',
                'NR': 'Adults', 'PG-13': 'Teens', 'TV-G': 'Kids', 'PG': 'Older Kids',
                'G': 'Kids', 'UR': 'Adults', 'NC-17': 'Adults'}

for i in range(len(netflix["rating"])) :
    if netflix["rating"][i] in ratings_ages :
        netflix["rating"][i] = ratings_ages[netflix["rating"][i]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  netflix["rating"][i] = ratings_ages[netflix["rating"][i]]


### Genre of Top Movies on 2021 

In [37]:
# elements = dom.select("#netflix-1 > div.-mx-content > div > div > table > tbody > tr")

In [46]:
for element in elements:
    print("https://flixpatrol.com/" + element.select("td")[2].select_one("a").get("href"))


AttributeError: 'NoneType' object has no attribute 'get'

In [45]:
for element in elements:
    url = "https://flixpatrol.com/" + element.select("td")[2].select_one("a").get("href")
    response = requests.get(url)
    dom = BeautifulSoup(response.text, "html.parser")
#     country = dom.select_one("body > div.content.mt-4 > div > div.flex-grow > div.mb-6 > div.flex.flex-wrap.text-gray-500 > div > span:nth-child(3)").text
#     genre = dom.select_one("body > div.content.mt-4 > div > div.flex-grow > div.mb-6 > div.flex.flex-wrap.text-gray-500 > div > span:nth-child(9)").text
#     totalmovies["genre"] = genre
#     totalmovies["country"] = country

AttributeError: 'NoneType' object has no attribute 'get'

### 