# IMDb

เว็บ IMDb รวบรวมข้อมูลเกี่ยวกับภาพยนตร์ต่างๆ และตัดข้อมูลบางส่วนมาให้ใช้ทำการทดลองต่างๆ ได้ https://www.imdb.com/interfaces/

โจทย์นี้คัดกรอง [ข้อมูลบางส่วนใน IMDb](https://drive.google.com/file/d/1V5VNg1WTMTS_eEOQqylektM9ZwdG1uq2/view?usp=sharing) มา โดยโดยมีไฟล์นี้อยู่

**Cast.tsv**
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

**Title.tsv**
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

**Star.tsv**
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

## Preprocess

ใช้ encoding='utf-8' ตอนอ่านไฟล์

In [226]:
def read_table(file):
    with open(file,encoding="utf-8") as fp:
        data = fp.readlines()
        data = [i.strip().split("\t") for i in data[::2]]
        datahead = data[0]
        datavalue = data[1:]
    return datahead,datavalue

In [227]:
df_star_head,df_star = read_table('Star.tsv')
df_cast_head,df_cast = read_table('Cast.tsv')
df_title_head,df_title = read_table('Title.tsv')
com_star  = [dict(zip(df_star_head,i)) for i in df_star]
com_cast  = [dict(zip(df_cast_head,i)) for i in df_cast]
com_title = [dict(zip(df_title_head,i)) for i in df_title]

# 1. มีภาพยนตร์กี่ประเภท (genres) แต่ละประเภทมีอย่างละกี่เรื่อง (เรียงตาม genre)
หาด้วย Title

```
Action 733
Adult 56
Adventure 277
Animation 132
Biography 17
8Comedy 1694
Crime 488
Documentary 484
Drama 3041
Family 247
Fantasy 218
History 120
Horror 559
Music 128
Musical 74
Mystery 268
News 7
Reality-TV 3
Romance 825
Sci-Fi 181
Sport 80
Thriller 726
War 83
Western 22
\N 270
```

In [228]:
genre = {}
for i in com_title:
    for j in (i["genres"]).split(","):
        if j not in genre:
            genre.setdefault(j,1)
        else:
            genre[j] += 1
for i in sorted(genre.items()):
    print(*i)

Action 733
Adult 56
Adventure 277
Animation 132
Biography 178
Comedy 1694
Crime 488
Documentary 484
Drama 3041
Family 247
Fantasy 218
History 120
Horror 559
Music 128
Musical 74
Mystery 268
News 7
Reality-TV 3
Romance 825
Sci-Fi 181
Sport 80
Thriller 726
War 83
Western 22
\N 270


# 2. มีนักแสดงชายและหญิงอย่างละกี่คน
นับจากไฟล์ Star

```
actress 3537
actor 4300
```

In [229]:
profession_count = {}
for i in com_star:
    for j in i["primaryProfession"].split(","):
        if (j not in profession_count) and (j in ("actor","actress")):
            profession_count.setdefault(j,1)
        elif (j in ("actor","actress")):
            profession_count[j] += 1
profession_count

{'actress': 3537, 'actor': 4300}

In [230]:
for key, val in profession_count.items():
    print(f"{key:<20}{val}")

actress             3537
actor               4300


# 3. นักแสดงที่ยังมีชีวิตมีอายุเท่าไหร่บ้าง แต่ละช่วงมีกี่คน 
หาด้วย Star

```
32 213
33 233
34 288
35 328
36 317
37 365
38 389
39 438
40 430
41 448
42 495
43 460
44 442
45 463
46 472
47 447
48 505
49 511
50 498
```

In [231]:
yrs = {}
for i in com_star:
    if i["birthYear"] not in yrs and not (i["deathYear"]).isnumeric():
        yrs.setdefault(i["birthYear"],1)
    elif not (i["deathYear"]).isnumeric():
        yrs[i["birthYear"]] += 1
for i in sorted(list(map(lambda x : (2021-int(x[0]),x[1]),list(yrs.items())))):
    print(*i)

32 213
33 233
34 288
35 328
36 317
37 365
38 389
39 438
40 430
41 448
42 495
43 460
44 442
45 463
46 472
47 447
48 505
49 511
50 498


# 4. มีนักแสดงกี่คนที่เคยแสดงในภาพยนตร์ Action 
หาด้วย Cast + Title

```
992
```

In [232]:
act1 = []
for i in com_title:
    if "Action" in i["genres"] :
        act1.append(i["tconst"])  
act_c = 0
castls = set()
for i in com_cast:
    if (i["tconst"] in act1) and (i["nconst"] not in castls) and (i["category"] in ("actor","actress")):
        castls.add(i["nconst"])
        act_c += 1
print(act_c)

992


# 5. มีนักแสดงกี่คนที่แสดงในภาพยนตร์มากกว่า 1 ประเภท 
หาด้วย Cast + Title

```
3861
```

In [233]:
def find_more_than_one(tconst):
    for i in com_title:
        if tconst == i["tconst"] and (len(i["genres"].split(",")) > 1):
            return True
    return False
castls = set()
g = 0
for i in com_cast:
    if (i["nconst"] not in castls) and find_more_than_one(i["tconst"]) and (i["category"] in ("actor","actress")):
        castls.add(i["nconst"])
        g+=1
g

3861

# 6. มีผู้กำกับกี่คนที่เป็นนักแสดงด้วย 
หาด้วย Star

```
1424
```

In [234]:
{"actor","actress","director"}.intersection(set(com_star[1045]["primaryProfession"].split(",")))

{'actress', 'director'}

In [235]:
cit = 0
for i in com_star:
    if "act" in i["primaryProfession"] and "director" in i["primaryProfession"]:
        cit+=1
cit

1424

# 7. หาจำนวนภาพยนต์แนวรอมคอมในแต่ละปี (นับเฉพาะปีที่มีตั้งแต่ 1 เรื่องขึ้นไป)
หาด้วย Title

```
1991 3
1992 2
1993 3
1994 1
1995 4
1996 8
1997 3
1998 13
1999 8
2000 8
2001 7
2002 15
2003 9
2004 8
2005 15
2006 13
2007 15
2008 9
2009 10
2010 25
2011 24
2012 18
2013 14
2014 24
2015 21
2016 22
2017 10
2018 14
2019 15
2020 5
```

In [236]:
romcom = {}
for i in com_title:
    if ({"Romance","Comedy"}.intersection(set(i["genres"].split(","))) == {"Romance","Comedy"}) and (i["startYear"] not in romcom):
        romcom.setdefault(i["startYear"],1)
    elif ({"Romance","Comedy"}.intersection(set(i["genres"].split(","))) == {"Romance","Comedy"}):
        romcom[i["startYear"]] += 1
for i in sorted(romcom.items()):
    print(*i)

1991 3
1992 2
1993 3
1994 1
1995 4
1996 8
1997 3
1998 13
1999 8
2000 8
2001 7
2002 15
2003 9
2004 8
2005 15
2006 13
2007 15
2008 9
2009 10
2010 25
2011 24
2012 18
2013 14
2014 24
2015 21
2016 22
2017 10
2018 14
2019 15
2020 5
