## Data description

Вам выдается 4 файла:

* relevance_train.csv -- обучающая выборка пар запрос-документ и асессорские метки релевантности (все документы имеют одинаковую релевантность, т.е. можно считать, что выданы просто релевантные документы).
* relevance_test.csv -- тестовая выборка (только номера запросов, для которых нужно найти релевантные документы)
* queries.csv -- запросы из relevance_test и relevance_train (в формате id_запроса, текст запроса)
* documents.csv -- документы из relevance_test и relevance_train (началу документа соответствует строка "TEXT $n$", где $n$ - это id данного документа)  <-- выяснилось, что это неправильное описание. Как правильно: оказывается, DocumentId - это порядковый номер текста в documents.csv И TEXT \$n\$ тут совсем не причем.


Колонки могут быть следующего типа:

* QueryId -- уникальный номер запроса
* DocumentId -- номер документа, не повторяется для одного запроса
* Relevance -- асессорская метка релевантности


Формат файла ответов приведен ниже. Пары запрос-документ должны соответствовать файлу relevance_test.csv и должны быть упорядочены по убыванию построенной функции релевантности. То есть так, как в поисковой выдаче.

QueryId,DocumentId

101,5

101,0

101,9

101,13

101,17

## Считывание и обработка данных

In [1]:
import pandas as pd
import numpy as np
import csv
import re
from xgboost import XGBClassifier
import nltk
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from collections import Counter

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/irinadmitrieva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/irinadmitrieva/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
relevance_train = pd.read_csv('relevance_train.csv', sep=',')
relevance_train.head()

Unnamed: 0,QueryId,DocumentId,Relevance
0,1,268,1
1,1,288,1
2,1,304,1
3,1,308,1
4,1,323,1


In [3]:
relevance_train.shape

(127, 3)

In [4]:
relevance_test = pd.read_csv('relevance_test.csv', sep=',')
relevance_test.head()

Unnamed: 0,QueryId
0,2
1,3
2,4
3,5
4,6


In [5]:
relevance_test.shape

(42, 1)

In [6]:
! echo "$(cat queries.csv)"

1,KENNEDY ADMINISTRATION PRESSURE ON NGO DINH DIEM TO STOP SUPPRESSING THE BUDDHISTS .
2,EFFORTS OF AMBASSADOR HENRY CABOT LODGE TO GET VIET NAM'S PRESIDENT DIEM TO CHANGE HIS POLICIES OF POLITICAL REPRESSION .
3,NUMBER OF TROOPS THE UNITED STATES HAS STATIONED IN SOUTH VIET NAM AS COMPARED WITH THE NUMBER OF TROOPS IT HAS STATIONED IN WEST GERMANY .
4,U.S . POLICY TOWARD THE NEW REGIME IN SOUTH VIET NAM WHICH OVER THREW PRESIDENT DIEM .
5,PERSONS INVOLVED IN THE VIET NAM COUP .
6,CEREMONIAL SUICIDES COMMITTED BY SOME BUDDHIST MONKS IN SOUTH VIET NAMAND WHAT THEY ARE SEEKING TO GAIN BY SUCH ACTS .
7,REJECTION BY PRINCE NORODOM SIHANOUK, AN ASIAN NEUTRALIST LEADER,OF ALL FURTHER U.S . AID TO HIS NATION .
8,U.N . TEAM SURVEY OF PUBLIC OPINION IN NORTH BORNEO AND SARAWAK ON THE QUESTION OF JOINING THE FEDERATION OF MALAYSIA .
9,OPPOSITION OF INDONESIA TO THE NEWLY-CREATED MALAYSIA .
10,GROWING CONTROVERSY IN SOUTHEAST ASIA OVER THE PROPOSED CREATION OF A FEDERATION OF MA

In [7]:
! echo "$(cat documents.csv)"

*TEXT 17
THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST
PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE
MADE NO ATTEMPT TO DEVISE A PLAN . LAST WEEK, AS THEY STUDIED THE
NASSAU ACCORD BETWEEN PRESIDENT KENNEDY AND PRIME MINISTER MACMILLAN,
EUROPEANS SAW EMERGING THE FIRST OUTLINES OF THE NUCLEAR NATO THAT THE
U.S . WANTS AND WILL SUPPORT . IT ALL SPRANG FROM THE ANGLO-U.S .
CRISIS OVER CANCELLATION OF THE BUG-RIDDEN SKYBOLT MISSILE, AND THE
U.S . OFFER TO SUPPLY BRITAIN AND FRANCE WITH THE PROVED POLARIS (TIME,
DEC . 28) . THE ONE ALLIED LEADER WHO UNRESERVEDLY WELCOMED THE POLARIS
OFFER WAS HAROLD MACMILLAN, WHO BY THUS KEEPING A SEPARATE NUCLEAR
DETERRENT FOR BRITAIN HAD SAVED HIS OWN NECK . BACK FROM NASSAU, THE
PRIME MINISTER BEAMED THAT BRITAIN NOW HAD A WEAPON THAT " WILL LAST A
GENERATION . THE TERMS ARE VERY GOOD . " MANY OTHER BRITONS WERE NOT SO
SURE . THOUGH THE GOVERNMENT WILL SHOULDER NONE OF THE $800 MILLION

A SPOKESMAN) . BUT DE GAULLE HAS YET TO SHOW THAT HE CAN REMAKE
EUROPE IN FRANCE'S IMAGE . BRITAIN AND THE U.S . ARE BETTING THAT HE
NEVER WILL . INDEED, THE UNANIMOUS OUTCRY FROM THE OTHER FIVE COMMON
MARKET NATIONS LAST WEEK WAS AMPLE TESTIMONY THAT THEY ARE CURRENTLY
DETERMINED TO RESIST FRANCE'S GRASP FOR LEADERSHIP . LAST WEEK THE
FIVE ABRUPTLY FORCED POSTPONEMENT OF A SPECIAL MEETING OF THE COMMON
MARKET FINANCE MINISTERS CALLED BY THE FRENCH TO INVESTIGATE THE
EXTENT OF U.S . INVESTMENT IN EUROPE, A PET HATE OF DE GAULLE'S THESE
DAYS . THIS REBUFF DISTURBED FRENCH MINISTER OF FINANCE VALERY GISCARD
D'ESTAING NOT AT ALL . HE REPORTEDLY REMARKED THAT THE FIVE HAD MERELY
DEPRIVED HIM OF HIS DESSERT MEANING, PRESUMABLY, THAT FRANCE HAD
ALREADY ENJOYED THE BRITISH LION FOR BREAKFAST . THE FIVE HAVE ALSO
SHOWN SIGNS OF BLOCKING OTHER FRENCH PROJECTS, E.G ., ALGERIA'S
ASSOCIATION WITH THE SIX . SHARP DIVISIONS . BUT AS THE INITIAL AND
GENUINE ANGER DIMINISHE

FLAT AS TEXAS AND MORE THAN TWICE ITS SIZE, AND AN ECONOMY BASED ON THE
OIL OF IRAQ, THE AGRICULTURE OF SYRIA, AN THE INDUSTRY AND COTTON OF
EGYPT . THE AGREEMENT CALLS FOR A SINGLE POLITICAL HEAD (ALMOST CERTAIN
TO BE NASSER) AND A CENTRAL PARLIAMENT BASED ON POPULATION, WHICH WOULD
GIVE EGYPT A TWO-THIRDS MAJORITY . THIS CENTRAL STATE WOULD BE
RESPONSIBLE FOR 1) DEFENSE AND FOREIGN POLICY, 2) A SOCIALIST ECONOMIC
FRAMEWORK, AND 3) UNIFIED EDUCATIONAL AND CULTURAL PROGRAMS . BUT
WITHIN THE UNION, EACH STATE WOULD HAVE ITS OWN ELECTED POPULAR
AUTHORITY AND ITS OWN PARLIAMENT . NOT REPRESENTED IN THE CAIRO TALKS
WAS PRIMITIVE YEMEN, WHOSE BOSS, ABDULLAH SALLAL, IS PROPPED UP BY
20,000 EGYPTIAN SOLDIERS, BUT SALLAL CABLED CAIRO ANNOUNCING HIS TOTAL
ADHERENCE TO WHATEVER IS DECIDED . BUT AT WEEK'S END, THE REPORTED "
CLOSE AGREEMENT " HAD APPARENTLY RUN INTO A SNAG . THE THREEPOWER TALKS
UNEXPECTEDLY BROKE UP AND, ACCORDING TO A COMMUNIQUE, WILL RESUME " IN
A F

WITH FRANCE'S PRESIDENT FOR ABOUT 35 MINUTES, COUNTING TIME OUT FOR
TRANSLATION . BRITAIN'S FOREIGN SECRETARY LORD HOME AND FRANCE'S
FOREIGN MINISTER MAURICE COUVE DE MURVILLE, WHO HAVE BEEN SNUBBING ONE
ANOTHER EVER SINCE BRITAIN WAS EXCLUDED FROM THE COMMON MARKET LAST
JANUARY, ALSO EXCHANGED CIVILITIES . WHAT'S MORE, AT AN ELYSEE PALACE
RECEPTION FOR SEATO DELEGATES FROM EIGHT COUNTRIES, LE GRAND CHARLES
AFFABLY DECLARED : " SECURITY MEANS COOPERATION . " SUBS, PLEASE . WAS
THIS PROGRESS ? SKEPTICS NOTED THAT FRANCE'S GOVERNMENT SEEMED MORE
INCLINED TO TALK COOPERATION THAN TO PRACTICE IT . AFTER INDICATING
THAT THEY MIGHT HONOR A 15-MONTH-OLD AGREEMENT TO ACCEPT
U.S.-CONTROLLED WARHEADS FOR ITS GERMAN-BASED F-100 FIGHTERBOMBERS, THE
FRENCH BRUSQUELY DENIED THAT THEY HAD ANY PRESENT PLANS " CONCERNING
THE USE OF THESE PLANES WITHIN NATO . " NOTHING DAUNTED, U.S .
OFFICIALS IN PARIS LEAKED WISHFUL REPORTS THAT FRANCE'S NUCLEAR FORCE
DE FRAPPE IS BADLY BEHI

IT WOULD TAKE THE VOPOS THREE SECONDS TO DRAW THEIR WEAPONS ONCE THEY
REALIZED WHAT I WAS DOING . BUT I THOUGHT I COULD MAKE IT IN THOSE
THREE SECONDS . BESIDES, WE HAD 30 BRICKS BEHIND MRS . THURAU TO
PROTECT HER IF FIRING STARTED . /
*TEXT 265
GREAT BRITAIN YER PAYS YER MONEY, YER TYKES YER CHOICE
PSEPHOLOGY, AS GUESSING ELECTIONS IS CALLED IN BRITAIN, IS ABOUT AS
INEXACT AN ART AS PLAYING THE FOOTBALL POOLS . FACED WITH A GENERAL
ELECTION THIS YEAR OR NEXT, THE EXPERTS LAST WEEK STUDIED A RICH CROP
OF AUGURIES WITH UNUSUAL DILIGENCE AND THE USUAL RESULTS : THEY
DISAGREED . CERTAINLY, THERE WAS LITTLE TO ENCOURAGE PRIME MINISTER
HAROLD MACMILLAN'S CONSERVATIVES IN THE OUTCOME OF 401 LOCAL BOROUGH
ELECTIONS . WITH 2,973 SEATS AT STAKE, THE TORIES LOST A TOTAL OF 550
; THE LABOR PARTY GAINED 544, WINNING CONTROL OF LOCAL GOVERNMENTS IN
SUCH MAJOR CITIES AS LEICESTER, LIVERPOOL, BRADFORD, BRISTOL AND
NOTTINGHAM . LABOR OFFICIALS CLAIMED THAT IF A GENERAL EL

OF WHICH HAVE ALWAYS BEEN ACCORDED A MEASURE OF BILINGUAL STATUS . AT
LOUVAIN UNIVERSITY, WALLOON PROFESSORS AND STUDENTS WENT OUT ON STRIKE,
BOYCOTTING LECTURES AND CLASSES IN PROTEST OVER THE PROPOSED BILINGUAL
SPLIT OF THE TRADITIONALLY FRENCH-SPEAKING UNIVERSITY . IN AN ATTEMPT
TO SOLVE THE EXPLOSIVE SITUATION, PREMIER LEFEVRE CALLED FOR A TWO-WEEK
/ LANGUAGE ARMISTICE " WHILE HIS GOVERNMENT TRIED TO WORK OUT ANOTHER
COMPROMISE . BUT WITH PARLIAMENT SPLIT ALONG LANGUAGE LINES, THERE WAS
LITTLE HOPE OF A SOLUTION . UNLESS FLEMINGS AND WALLOONS LEARN TO LIVE
WITH EACH OTHER, SAID ONE DEPUTY IN PARLIAMENT LAST WEEK, " I CAN ONLY
CONCLUDE THAT IN A SMALL COUNTRY LIVE SMALL PEOPLE . /
*TEXT 313
SOUTH VIET NAM THE MAKESHIFT KILLERS INTO THE
VALLEY OF DEATH LAST WEEK FLEW THE 800 . THEY WERE SOUTH VIETNAMESE
TROOPS BEING LIFTED BY A COMPANY OF U.S . H-21 TROOP-CARRYING
HELICOPTERS TO CLEAN OUT A COMMUNIST-INFESTED JUNGLE HIDEOUT 175 MILES
NORTHEAST OF SAIGON 

TO HIM LAUGH . " ADDED CLARE BOOTHE LUCE : " WE MUST STOP TRYING TO
MAKE PAPER DOLLS OF OUR WOMEN . " ANTHROPOLOGIST MARGARET MEAD WARNED
: " THE RUSSIANS TREAT MEN AND WOMEN INTERCHANGEABLY . WE TREAT MEN
AND WOMEN DIFFERENTLY . " AND VIVE LA DIFFERENCE, SAID AT LEAST ONE
RUSSIAN MALE LAST WEEK . " MY AGE AND CONSERVATIVE MENTAL MAKEUP
COMPELLED ME TO THINK UP TO THE LAST FEW DAYS THAT WE MEN WERE THE
RULERS OF MAN'S MIND AND THE SALT OF THE EARTH, " SAID NOVELIST MIKHAIL
(AND QUIET FLOWS THE DON) SHOLOKHOV . " AND WHAT DO WE SEE NOW ? A
WOMAN IN SPACE ! SAY WHAT YOU WILL, THIS IS INCOMPREHENSIBLE . IT
CONTRADICTS ALL MY SET CONCEPTIONS OF THE WORLD AND ITS POSSIBILITIES .
MANY SCIENTISTS
DOUBT ASTRONAUT GORDON COOPER'S REPORT OF SEEING TRUCKS ON THE ROAD
AND SMOKE COMING OUT OF CHIMNEYS IN TIBET . ACCORDING TO DR . W . R .
ADEY OF U.C.L.A., THIS IS EQUIVALENT TO SEEING OBJECTS 1 IN . IN
DIAMETER 4,000 FT . AWAY . HE THINKS COOPER HAD DISORDERS OF VISION OR

A MORTAL TERROR OF INSECURITY AND ENCOURAGES THEM TO STAY CELIBATE . AT
LEAST SOME OF THESE FACTORS ARE CHANGING, AND THE RELATIONS BETWEEN THE
SEXES SEEM LESS SELF-CONSCIOUS AND AT TIMES DOWNRIGHT FRIENDLY . THE
IRISH NOW WED YOUNGER ; THE AVERAGE MARRIAGE AGE DROPPED FROM NEARLY
35 FOR MEN IN 1929 TO JUST OVER 30, FROM 29 FOR WOMEN TO JUST UNDER 27
. FOR THE YOUNG, ONE OF THE MOST JOYOUS INNOVATIONS IN RECENT YEARS HAS
BEEN A PROLIFERATION OF DANCE HALLS, WHICH HAVE REACHED SCORES OF SMALL
COMMUNITIES, AND A BURGEONING OF BANDS " 200 IN ALL THAT KEEP THE OULD
SOD JUMPING WITH HIPPETYHOPPETY JAZZ AND CARRY SUCH INTRIGUING NAMES AS
REBELS, JETS, MONARCHS . UNLIKE THE OLD DAYS, WHEN THE LOCAL PRIEST
WOULD OFTEN DISPERSE A COUNTRY CEILIDH AT SUNDOWN, DANCE-HALL HOURS ARE
REGULATED BY MAGISTRATES, WHO TEND TO BE MORE LIBERAL . AND, AS ALWAYS,
THE WESTERN WORLD HAS ITS PLAYBOYS . IN LONDON, WHERE ONE IN EIGHT
BIRTHS IS ILLEGITIMATE, AUTHORITIES REPORT THAT A DISPR

ANARCHIC EX-BELGIAN NEIGHBOR, HAS LONG SEEMED QUIET AND PEACEFUL . BUT
WHEN IT CAME, YOULOU'S EXIT HAD ALL THE REVOLUTIONARY TRIMMINGS,
INCLUDING A STORMING OF THE LOCAL BASTILLE AND A MOB OUTSIDE THE PALACE
HOWLING FOR BREAD .     NNNN
HABITUALLY CLAD IN A CASSOCK OFTEN TOPPED BY A HOMBURG, AND SAID TO
HAVE CARRIED A PISTOL IN HIS ROBES, YOULOU AT 46 WAS ONE OF THE WORLD'S
MOST UNUSUAL STATESMEN . A MEMBER OF THE LARI TRIBE HIS NAME MEANS "
FETISH WHICH CANNOT BE GRASPED " HE WAS REARED BY CATHOLIC MISSIONARIES
AND IN 1946 ORDAINED A PRIEST . LATER, IN DEFIANCE OF ORDERS FROM HIS
SUPERIOR, YOULOU RAN FOR THE FRENCH ASSEMBLY (HE LOST) AND WAS
SUSPENDED BY THE CHURCH, IS STILL FORBIDDEN TO SAY MASS . BECAUSE OF
HIS SUSPENSION, HE WAS ACCLAIMED BY HIS COUNTRYMEN AS A VICTIM OF
DISCRIMINATION AND ELECTED MAYOR OF BRAZZAVILLE IN 1956 . EXPLOITING
CONGOLESE SUPERSTITIONS, HE SOON HAD MANY VOTERS CONVINCED THAT HIS
PERSONAL FETISH, A SMALL YELLOW CROCODILE, HAD " 

THE VOICES ARE LURING ME, URGING ME FROM THE MIDNIGHT MOON AND THE
SILENCE OF MY DESK TO WALK ON WAVE CRESTS ACROSS A SEA . MAYBE I'M A
MEDICINE MAN HEARING TALKING SAPS, SEEING BEHIND TREES ; BUT WHO'S
LOST HIS POWERS OF INVOCATION . BUT THE VOICES AND THE TREES ARE ONE
NAME SPELLING AND ONE FIGURE SILENCE-ETCHED ACROSS THE MOONFACE IS
WALKING, STEPPING OVER CONTINENTS AND SEAS . AND I RAISED MY HAND MY
TREMBLING HAND, GRIPPING MY HEART AS HANDKERCHIEF AND WAVED AND WAVED
AND WAVED BUT SHE TURNED HER EYES AWAY . GABRIEL OKARA
(NIGERIA) TO NEW YORK FOR
JAZZ ORCHESTRA : TRUMPET SOLO
POETRY AND EVOLVING WHAT ANTHOLOGIST HUGHES HOPEFULLY DESCRIBES AS A
LITERATURE THAT " WALKS WITH GRACE AND ALREADY IS BEGINNING TO ACHIEVE
AN INDIVIDUALITY QUITE ITS OWN . " MOST OF THE BLACK NEW WAVE POETS ARE
CONCERNED WITH NEGRITUDE, A FRENCH WORD FOR THE ESSENCE OF BLACKNESS
AND, BY EXTENSION, FOR A WORLD IN WHICH DESPAIR IS WHITE, WHILE GOD AND
INNOCENCE ARE BLACK . MANY W

BRICKS . POLICE WITH DRAWN CLUBS RUSHED THEM, JOINED BY VOLUNTEER
WHITES . A WHITE FARMER EMPTIED HIS PISTOL INTO THE AIR . ON THE FIELD,
SOUTH AFRICAN FORWARD DICK PUTTER WAS CUT IN THE MOUTH BY ONE MISSILE,
AND AS A BOTTLE SPUN TOWARD REFEREE PIET MYBURGH, A HUSKY AUSTRALIAN
SAVED HIM WITH A FLYING TACKLE . AFTER PLAY RESUMED, SOUTH AFRICA WON,
22 TO 6 . THE SCORE AMONG THE SPECTATORS : SIX WHITES AND 20 NONWHITES
HOSPITALIZED, TWO NONWHITES ARRESTED, 40 CARS IN THE PARKING LOT
DAMAGED FROM ROCKS RAINED ONTO THEM FROM THE BLACK STANDS . POSSIBLY
HINTING THAT NONWHITES MAY BE BARRED FROM ALL STADIUMS, FOREIGN
MINISTER ERIC LOUW THUNDERED THAT " ACTIVE MEASURES ARE NECESSARY TO
PREVENT THIS SORT OF THING, " WHICH, HE COMPLAINED, " IS GIVING SOUTH
AFRICA A BAD NAME OVERSEAS . " AS FOR PORT ELIZABETH, THE AUTHORITIES
THERE DECIDED THAT THE WISE COURSE WOULD BE TO SELL NO MORE BEVERAGES
IN BOTTLES AND TO DOUBLE THE ADMISSION PRICE FOR NONWHITES AT FUTURE
GAMES 

BANYAN TREE, WHICH PROVERBIALLY KILLS EVERY OTHER ORGANISM THAT
GROWS IN ITS SHADE.  IN THE WAKE OF THREE PARLIAMENTARY BY-ELECTION
DEFEATS LAST SPRING, NEHRU ANNOUNCED THAT HE WOULD ASK A DOZEN TOP
CABINET AND STATE MINISTERS TO RESIGN FROM THE GOVERNMENT IN ORDER
TO LET THEM GO TO WORK REVITALIZING THE PARTY ORGANIZATION AND
REBUILDING ITS STRENGTH AMONG THE VOTERS.  BUT THE KAMARAJ PLAN
WAS REALLY USED BY THE PRIME MINISTER AS A RUSE TO FLUSH OUT ALL THE
TOP CONTENDERS FOR HIS OWN JOB.  THERE IS EVEN WIDESPREAD
SUSPICION THAT NEHRU FORCED THE RESIGNATIONS OF HIS ABLEST MINISTERS
IN ORDER TO CLEAR THE WAY FOR HIS DAUGHTER, IMPERIOUS INDIRA GANDHI,
45, WIDOW OF A BACKBENCH CONGRESS POLITICIAN (NO KIN TO THE MAHATMA),
WHO HAS LONG BEEN THE PRIME MINISTER'S CLOSEST CONFIDANTE (HE
CALLS HER INDU, OR MOON), OFFICIAL HOSTESS AND POLITICAL
TROUBLESHOOTER.
RULER OF THE WORLD.  BUT DISCOUNTING INDIRA AS A REAL POLITICAL
CONTENDER, THE CHOICE OF MOST PARTY MEMBERS

OUTSIDE OF HIS FAMILY, MORO'S ONLY NONPOLITICAL PASSION IS FLOWERS,
WHICH HE RAISES AT A SMALL COUNTRY HOUSE ABOUT 30 MILES FROOM ROME . "
THERE, " HE SAYS, " I AM PERFECTLY AT EASE BECAUSE I AM WITH MY
CHILDREN AND MY ROSES . " VICE PREMIER PIETRO NENNI, 72, IS AS
IMPULSIVE AS MORO IS DELIBERATE . WHILE MORO ACHIEVES AGREEMENT THROUGH
PATIENCE, NENNI OFTEN GETS HIS WAY BY SHEER CHARM AND ELOQUENCE . IN
MORE THAN HALF A CENTURY, HIS CAREER HAS FOLLOWED INNUMERABLE TWISTS
AND TURNS . EVEN AS A YOUTH IN ROMAGNA, IN ITALY'S RUGGED NORTH, NENNI
WAS AN ADEPT AT POLITICAL EXPEDIENCY . AFTER HIS PEASANT PARENTS DIED,
HE WAS PLACED IN AN ORPHANAGE BY AN ARISTOCRATIC FAMILY . EVERY SUNDAY
NENNI RECITED HIS CATECHISM BEFORE THE COUNTESS AND IF HE DID WELL
RECEIVED A SILVER COIN . " GENEROUS BUT HUMILIATING, " HE RECALLED . HE
BECAME A SOCIALIST AND BEFRIENDED A FELLOW ROMAGNAN NAMED BENITO
MUSSOLINI . IL DUCE'S FASCIST THUGS LATER ALMOST SHOT NENNI WHEN THEY
RAIDED AN

GOVERNMENT HOPE TO ATTRACT ALL THE SKILLED CITIZENS IT NEEDS .
MOREOVER, THEY ARGUE, AUSTRALIA CAN NEVER REALIZE ITS POTENTIAL AS A
LEADER OF SOUTHEAST ASIA SO LONG AS ITS NEIGHBORS ARE CONVINCED THAT
AUSTRALIANS ARE WHITE SUPREMACISTS . PRIME MINISTER SIR ROBERT MENZIES
IN FACT ORDERED MORE LIBERAL INTERPRETATION OF IMMIGRATION POLICY, BUT
HE INSISTED DURING THE RECENT ELECTION CAMPAIGN THAT HE WOULD NEVER
PERMIT ANY BASIC REFORM IN THE LAW . TO DO SO, SAID HE, " WOULD CREATE
IN AUSTRALIA THE KIND OF DREADFUL PROBLEMS THEY NOW HAVE IN OTHER
COUNTRIES . /
*TEXT 555
THE HOLE IN THE WALL THE SWEETEST CHRISTMAS
MUSIC BERLINERS HAVE HEARD IN MORE THAN TWO YEARS HAD NOTHING TO DO
WITH BACH OR HANDEL . IT WAS THE UGLY STUTTER OF JACKHAMMERS TEARING
GATES IN THE BERLIN WALL, THE WHINE OF CRANES REMOVING ZIGZAG BARRIERS
FROM HEAVILY GUARDED CROSSING POINTS . THEN, LATE LAST WEEK, THE
CANDYSTRIPE CUSTOMS POLES WENT UP, AND THOUSANDS OF GRINNING,
GIFT-LADEN WEST B

In [8]:
def remove_punctuation(pattern, line):
    for pat in pattern:
        return("".join(re.findall(pat, line)))

In [9]:

def remove_number_from_beginning(line):
    pattern_begin = re.compile("^[0-9]+,")
    return(re.sub(pattern_begin, "", line))

In [10]:
def remove_stop_words(line):
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(line)
    filtered_line = [w for w in word_tokens if not w in stop_words]
    filtered_line = " ".join(filtered_line)
    return filtered_line

In [11]:
def process_document(document):
    
    pattern_remove_punc = ['[^!.,?]+']
    document = remove_punctuation(pattern_remove_punc, document)
    document = remove_stop_words(document)
    document = document.lower()
    return document

In [12]:
def process_query(query):
    
    query = remove_number_from_beginning(query)
    pattern_remove_punc = ['[^!.,?]+']
    query = remove_punctuation(pattern_remove_punc, query)
    query = remove_stop_words(query)
    query = query.lower()
    
    return query

In [13]:
documents = []
pattern_head = re.compile("^\*TEXT\s[0-9]+")

with open('documents.csv', 'r') as my_input_file:
    i = -1
    for line in my_input_file:
        if (re.search(pattern_head, line)):
            if (i != -1):
                documents[i] = process_document(documents[i])
            i += 1
        else:
            filtered_line = process_document(line)
            try:
                documents[i] += " " + filtered_line
            except IndexError:
                documents.append(" " + filtered_line)

In [14]:
len(documents)

423

In [15]:
queries = []
pattern = re.compile(r"^[0-9]+,")

with open('queries.csv', 'r') as queries_file:
    i = -1
    for line in queries_file:
        if (re.search(pattern, line)):
            i += 1
            filtered_line = process_query(line)
            try:
                queries[i] += " " + process_query(filtered_line)
            except IndexError:
                queries.append(" " + process_query(filtered_line))

In [16]:
len(queries)

83

In [17]:
column_names_train = ['QueryId', 'DocumentId', 'Relevance']
list_of_rows_train = []
for i in range(len(queries)):
    for j in range(len(documents)):
        if (i in set(relevance_train.iloc[:, 0])):
            relevance = 1
        else:
            relevance = 0
        new_row = {'QueryId': i+1, 'DocumentId': j+1, 'Relevance': relevance}
        list_of_rows_train.append(new_row)
        
train = pd.DataFrame(list_of_rows_train)

In [18]:
train.shape

(35109, 3)

In [19]:
train.head()

Unnamed: 0,DocumentId,QueryId,Relevance
0,1,1,0
1,2,1,0
2,3,1,0
3,4,1,0
4,5,1,0


In [20]:
column_names_test = ['QueryId', 'DocumentId']
list_of_rows_test = []

for i in range(len(relevance_test)):
    for j in range(len(documents)):
        new_row = {'QueryId': relevance_test['QueryId'][i], 'DocumentId': j+1}
        list_of_rows_test.append(new_row)

test = pd.DataFrame(list_of_rows_test)

In [21]:
test.shape

(17766, 2)

In [22]:
test.head()

Unnamed: 0,DocumentId,QueryId
0,1,2
1,2,2
2,3,2
3,4,2
4,5,2


In [23]:
def count_trigrams(string):
    string = '^^' + string + '$$'
    trigrams = set()
    trigrams_count = 0
    
    for i in range(len(string) - 2):
        trigrams.add(string[i:i+3])
        trigrams_count += 1
        
    return trigrams, trigrams_count

In [24]:
def word_probability(word, documents_list):
    
    count = 0
    
    for document in documents_list:
        document_words = set(word_tokenize(document))
        if (word in document_words):
            count += 1
    
    return count / len(documents)

In [25]:
def get_trigrams_factors_and_TFIDF(document, query):
    document_trigrams, document_trigrams_count = count_trigrams(document)
    query_trigrams, query_trigrams_count = count_trigrams(query)

    factors = []
    
    factors.append(float(len(document_trigrams.intersection(query_trigrams))))
#     factors.append(0. if document_trigrams_count == 0. else 0.1 + factors[0] / document_trigrams_count) 
    factors.append(0. if query_trigrams_count == 0. else 0.1 + factors[0] / query_trigrams_count)
    factors.append(TFIDF(document, query))
    
    return factors

In [26]:
def get_probabilities_of_words(documents_list):
    count_words = {}
    for document in documents_list:
        document_words = set(word_tokenize(document))
        for word in document_words:
            if (count_words.get(word) is None):
                count_words[word] = 1
            else:
                count_words[word] += 1
    count_words
    dict_probability = {}
    number_documents = len(documents_list)
    for k, v in count_words.items():
        dict_probability[k] = v / number_documents
    return dict_probability

In [27]:
words_probabilities_dict = get_probabilities_of_words(documents)

def TFIDF(document, query):

    query_words = word_tokenize(query)
    dictionary = Counter(word_tokenize(document))
    dictionary.setdefault(0.0)
    components = []
    for word in query_words:
        try:
            number = dictionary.get(word, 0.0) * np.log(1 / words_probabilities_dict.get(word, 0.0))
            components.append(number)
        except ZeroDivisionError:
            _ = True
    components = np.array(components)
    return np.sum(components)

In [28]:
def calc_factors(row):
    document = None
    query = None
    try:
        document = documents[row['DocumentId'] - 1]
        query = queries[row['QueryId'] - 1]
        return pd.Series(get_trigrams_factors_and_TFIDF(document, query))
    except IndexError:
        print('no such document with id: {} or query with id: {} '.format(row['DocumentId'], row['QueryId']))

In [29]:
train.iloc[35080]

DocumentId    395
QueryId        83
Relevance       0
Name: 35080, dtype: int64

In [30]:
train.iloc[2]

DocumentId    3
QueryId       1
Relevance     0
Name: 2, dtype: int64

In [31]:
TFIDF(documents[2], queries[0])

7.840154923683395

In [32]:
train_factors = train.apply(calc_factors, axis=1, reduce=False)
train_factors.reset_index(drop=True)

Unnamed: 0,0,1,2
0,43.0,0.673333,8.788898
1,21.0,0.380000,0.000000
2,36.0,0.580000,7.840155
3,22.0,0.393333,0.000000
4,47.0,0.726667,2.197225
5,37.0,0.593333,0.000000
6,43.0,0.673333,2.581636
7,20.0,0.366667,0.000000
8,35.0,0.566667,0.000000
9,23.0,0.406667,0.000000


In [33]:
test_factors = test.apply(calc_factors, axis=1, reduce=False)
test_factors.reset_index(drop=True)

Unnamed: 0,0,1,2
0,70.0,0.754206,5.593517
1,39.0,0.464486,0.119623
2,54.0,0.604673,0.023925
3,36.0,0.436449,0.071774
4,59.0,0.651402,1.410663
5,63.0,0.688785,7.892326
6,62.0,0.679439,0.311020
7,30.0,0.380374,0.023925
8,58.0,0.642056,0.239246
9,54.0,0.604673,2.697845


In [34]:
train_factors.shape, test_factors.shape

((35109, 3), (17766, 3))

In [55]:
clf = XGBClassifier(n_estimators=50)
clf.fit(train_factors.values, train[['Relevance']].values)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=50,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [56]:
clf.predict_proba(test_factors.values)

array([[0.54358125, 0.45641875],
       [0.5516395 , 0.4483605 ],
       [0.59480464, 0.4051954 ],
       ...,
       [0.5388657 , 0.46113428],
       [0.48147297, 0.51852703],
       [0.41196626, 0.58803374]], dtype=float32)

In [57]:
clf.classes_

array([0, 1])

In [58]:
train['Relevance'].values

array([0, 0, 0, ..., 0, 0, 0])

In [63]:
test['Relevance'] = clf.predict_proba(test_factors.values)[:, 0]

In [64]:
test = test.sort_values(['QueryId', 'Relevance'], ascending=[True, False])
test.head()

Unnamed: 0,DocumentId,QueryId,Relevance
12,13,2,0.745582
9,10,2,0.717859
164,165,2,0.681059
130,131,2,0.653199
338,339,2,0.619152


In [65]:
test[test['QueryId'] == 2]

Unnamed: 0,DocumentId,QueryId,Relevance
12,13,2,0.745582
9,10,2,0.717859
164,165,2,0.681059
130,131,2,0.653199
338,339,2,0.619152
206,207,2,0.619152
145,146,2,0.606883
159,160,2,0.606883
99,100,2,0.606883
278,279,2,0.605518


In [62]:
test[['QueryId', 'DocumentId']].to_csv('baseline.csv', index=None)

In [66]:

list_of_rows_result = []

for i in range(len(relevance_test)):
    for j in range(len(documents)):
        new_row = {'QueryId': relevance_test['QueryId'][i], 'DocumentId': j+1, 'TFIDF': TFIDF(documents[j], queries[relevance_test['QueryId'][i] - 1])}
        list_of_rows_result.append(new_row)

result = pd.DataFrame(list_of_rows_result)

In [69]:
result = result.sort_values(['QueryId', 'TFIDF'], ascending=[True, False])
result

Unnamed: 0,DocumentId,QueryId,TFIDF
325,326,2,182.270350
369,370,2,165.101695
333,334,2,159.443082
348,349,2,148.374944
210,211,2,129.036024
358,359,2,85.413075
307,308,2,83.129490
303,304,2,72.211226
406,407,2,58.404156
394,395,2,54.486863


In [68]:
result[['QueryId', 'DocumentId']].to_csv('baseline.csv', index=None)