실습에 활용 될 데이터셋을 소개 드리겠습니다

---

# 1. The Simpsons dataset

![](https://images.edrawmax.com/what-is/simpsons-family-tree/example.png) <br>
출처 : https://images.edrawmax.com/what-is/simpsons-family-tree/example.png

- 다운로드 : https://www.kaggle.com/datasets/pierremegret/dialogue-lines-of-the-simpsons?resource=download

# 2. Quora dataset

#### 데이터 소개 : 네이버의 지식IN과 비슷한 목적을 가진 플랫폼인 Quora에서, 유사한 질문들을 모아둔 데이터 셋.
#### 데이터 활용 목적 : Embedding을 기반으로 유사한 질문을 탐색하는 실습에 활용

- api의 형태로 원본 데이터 다운로드

In [39]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("quora")

In [40]:
raw_df = dataset["train"].to_pandas()

중복된 질문이라고 체크된 질문들만 선택

In [42]:
raw_df = raw_df.loc[raw_df['is_duplicate']==True].reset_index(drop=True)

In [43]:
raw_df.loc[0, 'questions']

{'id': array([1, 2], dtype=int32),
 'text': array(['What is the step by step guide to invest in share market in india?',
        'What is the step by step guide to invest in share market?'],
       dtype=object)}

In [44]:
# 중복되는 id를 개별 컬럼으로 배치
raw_df["q1"] = raw_df["questions"].apply(lambda q: q["text"][0])
raw_df["q2"] = raw_df["questions"].apply(lambda q: q["text"][1])
raw_df["id1"] = raw_df["questions"].apply(lambda q: q["id"][0])
raw_df["id2"] = raw_df["questions"].apply(lambda q: q["id"][1])

q1_to_q2 = raw_df.copy().rename(columns={"q1": "text", "id1": "id", "id2": "dq_id"}).drop(columns=["questions", "q2"])
q2_to_q1 = raw_df.copy().rename(columns={"q2": "text", "id2": "id", "id1": "dq_id"}).drop(columns=["questions", "q1"])
flat_df = pd.concat([q1_to_q2, q2_to_q1])

flat_df = flat_df.sort_values(by=['id']).reset_index(drop=True)

In [45]:
flat_df.head()

Unnamed: 0,is_duplicate,text,id,dq_id
0,False,What is the step by step guide to invest in sh...,1,2
1,False,What is the step by step guide to invest in sh...,2,1
2,False,What is the story of Kohinoor (Koh-i-Noor) Dia...,3,380197
3,False,What is the story of Kohinoor (Koh-i-Noor) Dia...,3,282170
4,False,What is the story of Kohinoor (Koh-i-Noor) Dia...,3,488853


전체 데이터 중 작은 샘플만 활용

In [46]:
flat_df = flat_df.loc[((flat_df['id'] <= 15000) & (flat_df['dq_id'] <= 15000))]

In [47]:
# flat_df.loc[flat_df['id']==4310]
# flat_df.loc[flat_df['id'].isin([4311, 15621, 147622])]

In [48]:
# 각 질문 하나당 중복되는 질문 id를 list 형태로 저장
df = flat_df.drop_duplicates("id")
# df = flat_df.head(10000)
df["duplicated_questions"] = df["id"].apply(lambda qid: flat_df[flat_df["id"] == qid]["dq_id"].tolist())
df = df.drop(columns=["dq_id", "is_duplicate"])
df['length'] = [len(x) for x in df['duplicated_questions']]

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["duplicated_questions"] = df["id"].apply(lambda qid: flat_df[flat_df["id"] == qid]["dq_id"].tolist())


Unnamed: 0,text,id,duplicated_questions,length
0,What is the step by step guide to invest in sh...,1,[2],1
1,What is the step by step guide to invest in sh...,2,[1],1
5,What is the story of Kohinoor (Koh-i-Noor) Dia...,3,[4],1
6,What would happen if the Indian government sto...,4,[3],1
8,How can I increase the speed of my internet co...,5,[6],1


In [28]:
df.loc[df['length']>1]

Unnamed: 0,text,id,duplicated_questions,length
14,What would a Trump presidency mean for current...,31,"[6937, 12544, 11435, 32, 1101]",5
24,How will a Trump presidency affect the student...,32,"[2067, 1100, 6937, 12544, 31, 1101, 2066, 1143...",10
46,Why are so many Quora users posting questions ...,37,"[12639, 1358, 4951, 1357, 6551, 38]",6
63,Why do people ask Quora questions which can be...,38,"[4950, 4407, 4408, 6552, 6551, 12638, 5041, 12...",14
126,What is best way to make money online?,57,"[6800, 12851, 13144, 6099, 4038, 8037, 6799, 1...",23
...,...,...,...,...
33080,What are the safety precautions on handling sh...,14966,"[1596, 10671, 12719, 8119, 5903, 5434, 1595, 1...",10
33112,How did the 2016 US election polls get it so w...,14976,"[14977, 10435, 10434]",3
33122,Why were the polls so inaccurate in the 2016 e...,14977,"[10435, 14976]",2
33152,What is it like to lose 30 pounds in one month?,14992,"[9736, 9737]",2


In [29]:
df.to_csv("quora_dataset.csv", index=False)

In [30]:
df.shape

(5539, 4)

# 3. ABC News dataset

![](https://s.abcnews.com/images/US/abc_news_default_2000x2000_update_16x9_992.jpg) <br>
출처 : https://s.abcnews.com/images/US/abc_news_default_2000x2000_update_16x9_992.jpg

- 다운로드 : https://www.kaggle.com/datasets/therohk/million-headlines

In [50]:
df = pd.read_csv("abcnews.csv")

In [51]:
df.shape

(1244184, 2)

In [52]:
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [53]:
df.publish_date.max(), df.publish_date.min()

(20211231, 20030219)

In [64]:
news_2020 = df.loc[(df['publish_date']>=20200101) & (df['publish_date']<20200201)].reset_index(drop=True)

In [65]:
news_2020

Unnamed: 0,publish_date,headline_text
0,20200101,a new type of resolution for the new year
1,20200101,adelaide records driest year in more than a de...
2,20200101,adelaide riverbank catches alight after new ye...
3,20200101,adelaides 9pm fireworks spark blaze on riverbank
4,20200101,archaic legislation governing nt women propert...
...,...,...
2442,20200131,who coronavirus global emergency
2443,20200131,who declares coronavirus outbreak as global he...
2444,20200131,will travel insurance cover trip cancelled ove...
2445,20200131,world youngest leader 33 years old offers hope...


In [66]:
news_2020.to_csv("abcnews_2020.csv", index=False)

--END--