## Data Analysis - Circa Dataset (ParlAI - Facebook Research)

###### **Original Dataset:** [Circa - Indirect yes/no answers in dialog](https://github.com/google-research-datasets/circa]
###### **Students:** Yacine MOKHTARI & Lilia IZRI

## Partie 1

### Lectures des données + rapide analyse

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

PATH = "./"
FILENAME = "circa-data.tsv"
# COLNAMES = ...
df_circa_raw = pd.read_csv(PATH+FILENAME, sep="\t")

In [2]:
df_circa_raw.describe(include="all")

Unnamed: 0,id,context,question-X,canquestion-X,answer-Y,judgements,goldstandard1,goldstandard2
count,34268.0,34268,34268,34258,34268,34268,31525,33497
unique,,10,3345,3259,30323,3454,8,5
top,,Y has just told X that he/she is thinking of b...,Do you have kids?,I have kids .,I would love to,Yes#Yes#Yes#Yes#Yes,Yes,Yes
freq,,3500,40,120,59,9120,14504,16628
mean,17133.5,,,,,,,
std,9892.463849,,,,,,,
min,0.0,,,,,,,
25%,8566.75,,,,,,,
50%,17133.5,,,,,,,
75%,25700.25,,,,,,,


In [3]:
df_circa_raw.head(10)

Unnamed: 0,id,context,question-X,canquestion-X,answer-Y,judgements,goldstandard1,goldstandard2
0,0,Y has just travelled from a different city to ...,Are you employed?,I am employed .,I'm a veterinary technician.,Yes#Yes#Yes#Yes#Yes,Yes,Yes
1,1,X wants to know about Y's food preferences.,Are you a fan of Korean food?,I am a fan of Korean food .,I wouldn't say so,Probably no#No#No#No#Probably yes / sometimes yes,No,No
2,2,Y has just told X that he/she is thinking of b...,Are you bringing any pets into the flat?,I am bringing pets into the flat .,I do not own any pets,No#No#No#No#No,No,No
3,3,X wants to know what activities Y likes to do ...,Would you like to get some fresh air in your f...,I would like to get fresh air in my free time .,I am desperate to get out of the city.,"Yes#Yes, subject to some conditions#Probably y...",Yes,Yes
4,4,X and Y are childhood neighbours who unexpecte...,Is your family still living in the neighborhood?,My family is living in the neighborhood .,My parents are snowbirds now.,"No#In the middle, neither yes nor no#Probably ...","In the middle, neither yes nor no","In the middle, neither yes nor no"
5,5,X wants to know what sorts of books Y likes to...,Do you like to read self-help books?,I like to read self-help books .,I'm not a fan of them,No#No#No#No#No,No,No
6,6,X wants to know about Y's food preferences.,Do you enjoy foreign cuisine?,I enjoy foreign cuisine .,I like many cuisines.,Yes#Yes#Yes#Probably yes / sometimes yes#Proba...,Yes,Yes
7,7,Y has just travelled from a different city to ...,Is your new job going well?,My new job is going well .,I love what I do.,Yes#Yes#Yes#Probably yes / sometimes yes#Yes,Yes,Yes
8,8,X wants to know what sorts of books Y likes to...,Are long books your thing?,Long books am my thing .,I rarely read any other type of book.,Yes#Yes#Yes#Yes#Yes,Yes,Yes
9,9,X wants to know about Y's food preferences.,Have you had pizza recently,I have I had pizza recently .,My husband ordered some last night.,Yes#Probably yes / sometimes yes#Yes#Yes#Yes,Yes,Yes


In [4]:
df_circa_raw = df_circa_raw[["question-X", "canquestion-X", "answer-Y"]]
df_circa_raw.isna().sum()

question-X        0
canquestion-X    10
answer-Y          0
dtype: int64

In [5]:
df_circa_raw[df_circa_raw.isna()["canquestion-X"].values]

Unnamed: 0,question-X,canquestion-X,answer-Y
820,While you're in town do you want to go see a g...,,I have tickets to Saturday's game.
2378,While you're in town do you want to go see a g...,,I wouldn't be a red-blooded American if I didn't.
7589,While you're in town do you want to go see a g...,,| would love that
8451,While you're in town do you want to go see a g...,,I'd rather go dancing with you.
10857,While you're in town do you want to go see a g...,,I'd rather see a baseball game.
16986,While you're in town do you want to go see a g...,,If we can get cheap tickets.
22367,While you're in town do you want to go see a g...,,Depends on who's playing.
32415,While you're in town do you want to go see a g...,,If you can get tickets.
32948,While you're in town do you want to go see a g...,,Football is a dumb sport.
33455,While you're in town do you want to go see a g...,,I prefer to watch baseball.


**Observations**
+ Les réponses n'ont pas l'air très intéressantes, donc je préfère simplement enlever ces lignes là ! 

In [6]:
df = df_circa_raw[~df_circa_raw.isna()["canquestion-X"].values]
df.isna().sum()

question-X       0
canquestion-X    0
answer-Y         0
dtype: int64

### Extraction des Q&A

In [7]:
df["answer-Y"]

0                  I'm a veterinary technician.
1                             I wouldn't say so
2                         I do not own any pets
3        I am desperate to get out of the city.
4                 My parents are snowbirds now.
                          ...                  
34263                               I am in AA.
34264                 My favorite pie is pecan.
34265             I'd rather do something else.
34266            I can't dance to hip/hop music
34267               I will never have a family.
Name: answer-Y, Length: 34258, dtype: object

In [8]:
df["question-X"]

0                                        Are you employed?
1                            Are you a fan of Korean food?
2                 Are you bringing any pets into the flat?
3        Would you like to get some fresh air in your f...
4         Is your family still living in the neighborhood?
                               ...                        
34263                                Do you like to drink?
34264                                     Do you like pie?
34265                     Want to go to a concert with me?
34266                           Do you like hip/hop music?
34267    Do you see yourself raising a family in New York?
Name: question-X, Length: 34258, dtype: object

In [9]:
def extractQA(df):
    """
    PRECISIONS : 
        J'extrait uniquement les questions et réponses du style :
            - (question-X --> answer-Y)    /// extraction logique
            - (canquestion-X --> answer-Y) /// on préfère  avoir une réponse à la  version affirmative de l'input utilisateur
        On ne prend pas en compte le 'Yes/No/...' 
    """
    questions = list(df["question-X"].values) + list(df["canquestion-X"].values)
    answers = list(df["answer-Y"].values) * 2
    return questions, answers

# extraction
questions, answers = extractQA(df)
assert len(questions) == len(answers)
print("Nombre de Q&A total :", len(questions))

Nombre de Q&A total : 68516


In [10]:
df_qa = pd.DataFrame(data =  { "Question" : questions,  "Answer" : answers})
df_qa.head(5)

Unnamed: 0,Question,Answer
0,Are you employed?,I'm a veterinary technician.
1,Are you a fan of Korean food?,I wouldn't say so
2,Are you bringing any pets into the flat?,I do not own any pets
3,Would you like to get some fresh air in your f...,I am desperate to get out of the city.
4,Is your family still living in the neighborhood?,My parents are snowbirds now.


### Tests unitaires
<small><i>Même si j'aurais du faire ça avant 🥲</i></small>

In [11]:
print("Nombre de paires (Q, A) duppliquées :", df_qa.duplicated().sum())

Nombre de paires (Q, A) duppliquées : 423


In [12]:
df_qa[df_qa.duplicated()]

Unnamed: 0,Question,Answer
1133,Do you have a dog?,I have two.
2209,Do you like to eat out?,I love to eat out
2547,Are you interested in trap music?,I don't know what that is
3020,Would you like to go the pub sometime?,I don't drink.
4408,are you married?,I'm divorced.
...,...,...
68078,I work near by .,I work from home.
68097,I have kids .,I wish I had kids.
68104,I want large portions .,I'm trying to lose weight.
68285,I am busy tonight .,I don't have any plans.


+ En regardant à peu près les données duppliquées, je me dis qu'il y aura vraiment beaucoup de choses à écrire à la main pour personnaliser notre chatbot ahaha 😭

In [13]:
# Retirer les dupliqués 
df_qa = df_qa[~df_qa.duplicated()]
print("Nombre de duppliqués :", df_qa.duplicated().sum())
print("Taille du dataset :", len(df_qa))

Nombre de duppliqués : 0
Taille du dataset : 68093


### Sauvegarder le jeu de données ainsi obtenu !

In [14]:
OUTPUT_PATH = "./output/"
OUTPUT_NAME = "circa-main_RAW_EXTRACTION_QA.tsv"

df_qa.to_csv(OUTPUT_PATH + OUTPUT_NAME, sep="\t", index=False)

### Prochaine étape 
* Parcourir le dataset :
    * Enlever les (Question,  Réponse) qui n'ont pas de sens/inutiles pour nous
    * Edit certaine réponses si les questions sont intéressantes
* Traduire en français (informel/amical : Tu et non en vous)

## Partie 2

### Filtrer suivant des "mots" innapropriés

In [15]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt


In [19]:
# read_csv : 
PATH = "./output/"
FILENAME = "circa-main_RAW_EXTRACTION_QA.tsv"
df_qa = pd.read_csv(PATH+FILENAME, sep="\t")
df_qa.head(10)


Unnamed: 0,Question,Answer
0,Are you employed?,I'm a veterinary technician.
1,Are you a fan of Korean food?,I wouldn't say so
2,Are you bringing any pets into the flat?,I do not own any pets
3,Would you like to get some fresh air in your f...,I am desperate to get out of the city.
4,Is your family still living in the neighborhood?,My parents are snowbirds now.
5,Do you like to read self-help books?,I'm not a fan of them
6,Do you enjoy foreign cuisine?,I like many cuisines.
7,Is your new job going well?,I love what I do.
8,Are long books your thing?,I rarely read any other type of book.
9,Have you had pizza recently,My husband ordered some last night.


In [20]:
"UeuE".lower()

'ueue'

In [23]:

qa_words_filter = ["god", "racism", "racist", "xenophobia", "sexism", "sexist"]
# looking for these only 
a_words_filter = ["husband", "wife", "boyfriend", "girlfriend", "daughter", "son"]

def word_detection(df, list_words,look_questions=False):
    idx_list = []
    for i in range(len(df)):
        tmp = df.iloc[i]
        for word in list_words:
            if (word in tmp["Answer"].lower()):
                idx_list.append(i)
            elif (look_questions) and (word in tmp["Question"].lower()):
                idx_list.append(i)
                
    return idx_list

l1 = word_detection(df_qa, a_words_filter)

AttributeError: 'str' object has no attribute 'values'