# Форматы данных (1)

Материалы:
* Макрушин С.В. "Лекция 4: Форматы данных"
* https://docs.python.org/3/library/json.html
* https://docs.python.org/3/library/pickle.html
* https://www.crummy.com/software/BeautifulSoup/bs4/doc.ru/bs4ru.html
* Уэс Маккини. Python и анализ данных

## Задачи для совместного разбора

1. Вывести все адреса электронной почты, содержащиеся в адресной книге `addres-book.json`

In [190]:
import json
import pickle
from collections import defaultdict
from pathlib import Path

import pandas as pd
from bs4 import BeautifulSoup as bs

In [191]:
def read_json(path: str):
    with open(path, encoding='utf-8') as f:
        return json.load(f)


def write_json(path: str, data: dict):
    with open(path, mode='w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

In [192]:
def read_xml(path: str) -> bs:
    with open(path, encoding='utf-8') as f:
        return bs(''.join(f.readlines()), 'lxml')

In [193]:
def read_pickle(path: str):
    with open(path, mode='rb') as f:
        return pickle.load(f)


def write_pickle(path: str, data: dict):
    with open(path, mode='wb') as f:
        pickle.dump(data, f)

In [194]:
address_book = read_json('data/addressBook.json')
address_book

[{'name': 'Faina Lee',
  'email': 'faina@mail.ru',
  'birthday': '22.08.1994',
  'phones': [{'phone': '232-19-55'}, {'phone': '+7 (916) 232-19-55'}]},
 {'name': 'Robert Lee',
  'email': 'robert@mail.ru',
  'birthday': '22.08.1994',
  'phones': [{'phone': '111-19-55'}, {'phone': '+7 (916) 445-19-55'}]}]

In [195]:
emails = [address.get('email') for address in address_book]
emails

['faina@mail.ru', 'robert@mail.ru']

2. Вывести телефоны, содержащиеся в адресной книге `addres-book.json`

In [196]:
phones = []
for address in address_book:
    phones.extend([phone['phone'] for phone in address.get('phones')])

phones

['232-19-55', '+7 (916) 232-19-55', '111-19-55', '+7 (916) 445-19-55']

3. По данным из файла `addres-book-q.xml` сформировать список словарей с телефонами каждого из людей.

In [197]:
soup = read_xml('data/address_book.xml')
soup

<?xml version="1.0" encoding="UTF-8" ?><html><body><address_book>
<country name="algeria">
<address id="1">
<gender>m</gender>
<name>Aicha Barki</name>
<email>aiqraa.asso@caramail.com</email>
<position>Presidente</position>
<company>Association Algerienne d'Alphabetisation Iqraa</company>
<phones>
<phone type="work">+ (213) 6150 4015</phone>
<phone type="personal">+ (213) 2173 5247</phone>
</phones>
</address>
</country>
<country name="angola">
<address id="2">
<gender>m</gender>
<name>Francisco Domingos</name>
<email>frandomingos@hotmail.com</email>
<position>Directeur General</position>
<company>Institut National de Education des Adultes</company>
<phones>
<phone type="work">+ (244-2) 325 023</phone>
<phone type="personal">+ (244-2) 325 023</phone>
</phones>
</address>
<address id="3">
<gender>f</gender>
<name>Maria Luisa</name>
<email>luisagrilo@ebonet.net</email>
<position>Directrice Nationale</position>
<company>Institut National de Education des Adultes</company>
<phones>
<phone 

In [198]:
phone_book = []
for address in soup.find_all('address'):
    name = address.find('name').string
    phones = [phone.string for phone in address.find_all('phone')]
    phone_book.append({
        'name': name,
        'phones': phones
    })

phone_book

[{'name': 'Aicha Barki', 'phones': ['+ (213) 6150 4015', '+ (213) 2173 5247']},
 {'name': 'Francisco Domingos',
  'phones': ['+ (244-2) 325 023', '+ (244-2) 325 023']},
 {'name': 'Maria Luisa', 'phones': ['+ (244) 4232 2836']},
 {'name': 'Abraao Chanda',
  'phones': ['+ (244-2) 325 023', '+ (244-2) 325 023']},
 {'name': 'Beatriz Busaniche', 'phones': ['+ (54-11) 4784 1159']},
 {'name': 'Francesca Beddie',
  'phones': ['+ (61-2) 6274 9500', '+ (61-2) 6274 9513']},
 {'name': 'Graham John Smith', 'phones': ['+ (61-3) 9807 4702']}]

## Лабораторная работа №4

### JSON

1.1 Считайте файл `contributors_sample.json`. Воспользовавшись модулем `json`, преобразуйте содержимое файла в соответствующие объекты python. Выведите на экран информацию о первых 3 пользователях.

In [199]:
contributors = read_json('data/contributorsSample.json')
contributors[:3]

[{'username': 'uhebert',
  'name': 'Lindsey Nguyen',
  'sex': 'F',
  'address': '01261 Cameron Spring\nTaylorfurt, AK 97791',
  'mail': 'jsalazar@gmail.com',
  'jobs': ['Energy engineer',
   'Engineer, site',
   'Environmental health practitioner',
   'Biomedical scientist',
   'Jewellery designer'],
  'id': 35193},
 {'username': 'vickitaylor',
  'name': 'Cheryl Lewis',
  'sex': 'F',
  'address': '66992 Welch Brooks\nMarshallshire, ID 56004',
  'mail': 'bhudson@gmail.com',
  'jobs': ['Music therapist',
   'Volunteer coordinator',
   'Designer, interior/spatial'],
  'id': 91970},
 {'username': 'sheilaadams',
  'name': 'Julia Allen',
  'sex': 'F',
  'address': 'Unit 1632 Box 2971\nDPO AE 23297',
  'mail': 'darren44@yahoo.com',
  'jobs': ['Management consultant',
   'Engineer, structural',
   'Lecturer, higher education',
   'Theatre manager',
   'Designer, textile'],
  'id': 1848091}]

1.2 Выведите уникальные почтовые домены, содержащиеся в почтовых адресах людей

In [200]:
mail_domains = set()
for contributor in contributors:
    mail_domains.add(contributor['mail'].split('@')[-1])

mail_domains

{'gmail.com', 'hotmail.com', 'yahoo.com'}

1.3 Напишите функцию, которая по `username` ищет человека и выводит информацию о нем. Если пользователь с заданным `username` отсутствует, возбудите исключение `ValueError`

In [201]:
# допущение: username является уникальным
indexed_data = {contributor['username']: contributor for contributor in contributors}

In [202]:
def find_by_username(username: str) -> dict:
    contributor = indexed_data.get(username)
    if contributor is None:
        raise ValueError(f'Контрибьютер с {username=} не найден.')

    return contributor


In [203]:
find_by_username('uhebert')

{'username': 'uhebert',
 'name': 'Lindsey Nguyen',
 'sex': 'F',
 'address': '01261 Cameron Spring\nTaylorfurt, AK 97791',
 'mail': 'jsalazar@gmail.com',
 'jobs': ['Energy engineer',
  'Engineer, site',
  'Environmental health practitioner',
  'Biomedical scientist',
  'Jewellery designer'],
 'id': 35193}

In [204]:
try:
    find_by_username('aabbcc')
except ValueError as e:
    print(e)

Контрибьютер с username='aabbcc' не найден.


1.4 Посчитайте, сколько мужчин и женщин присутсвует в этом наборе данных.

In [205]:
def count_by_sex(data):
    counter = defaultdict(int)  # спасибо, что я знаю тебя
    for contributor in data:
        counter[contributor['sex'].upper()] += 1

    return {'M': counter['M'], 'F': counter['F']}

In [206]:
count_by_sex(contributors)

{'M': 2064, 'F': 2136}

1.5 Создайте `pd.DataFrame` `contributors`, имеющий столбцы `id`, `username` и `sex`.

In [207]:
contributors_df = pd.DataFrame([
    {
        'id': contributor['id'],
        'username': contributor['username'],
        'sex': contributor['sex']
    }
    for contributor in contributors
])

contributors_df

Unnamed: 0,id,username,sex
0,35193,uhebert,F
1,91970,vickitaylor,F
2,1848091,sheilaadams,F
3,50969,nicole82,F
4,676820,jean67,M
...,...,...,...
4195,423555,stevenspencer,F
4196,35251,rwilliams,M
4197,135887,lmartinez,F
4198,212714,brendahill,M


1.6 Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в таблицу `recipes`. Объедините `recipes` с таблицей `contributors` с сохранением строк в том случае, если информация о человеке отсутствует в JSON-файле. Для скольких человек информация отсутствует? 

In [208]:
recipes_df = pd.read_csv('data/recipes_sample.csv', sep=',', parse_dates=['submitted'])
recipes_df

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,
29998,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,


In [209]:
recipes_contributors_df = pd.merge(
    recipes_df.drop_duplicates('contributor_id'),
    contributors_df,
    how='left',
    left_on='contributor_id',
    right_on='id'
)

recipes_contributors_df

Unnamed: 0,name,id_x,minutes,contributor_id,submitted,n_steps,description,n_ingredients,id_y,username,sex
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,35193.0,uhebert,F
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,91970.0,vickitaylor,F
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,,,
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,,,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,,,
...,...,...,...,...,...,...,...,...,...,...,...
8399,zucchini strips,279769,30,522304,2008-01-18,8.0,"spicy, salty, crispy zucchini strips!",4.0,,,
8400,zucchini with bacon corn peppers,326105,45,896136,2008-09-19,8.0,this is a very colorful addition to any meal. ...,7.0,,,
8401,zucchini with bell pepper and tomato,363362,13,344321,2009-03-29,19.0,the weather has been turning warmer and i have...,8.0,344321.0,mistyray,F
8402,zucchini with serrano ham,162411,15,152500,2006-03-31,6.0,"this dish is from tim malzer, a german chef wh...",5.0,,,


In [210]:
recipes_contributors_df['id_y'].isna().sum()

4204

### pickle

2.1 На основе файла `contributors_sample.json` создайте словарь следующего вида: 
```
{
    должность: [список username людей, занимавших эту должность]
}
```

In [211]:
job_people = defaultdict(list)
for contributor in contributors:
    for job in contributor['jobs']:
        job_people[job.strip().casefold()].append(contributor['username'])

print({k: v for k, v in list(job_people.items())[:3]})

{'energy engineer': ['uhebert', 'annmoore', 'garysilva', 'martinezashley', 'sextonsheila', 'pjames', 'smithjonathan', 'wardjames', 'cwheeler', 'ucarlson', 'robert71', 'johnsontheresa', 'amanda41', 'stacey47', 'timothynelson', 'timothynelson', 'rogersmichael', 'melissa94', 'wmcdaniel', 'charles74', 'smithjennifer', 'clintonjones'], 'engineer, site': ['uhebert', 'nancy12', 'andrea03', 'catherineross', 'wesley32', 'natalieross', 'rossdoris', 'christophersmith', 'dbooker', 'ericarobertson', 'trantricia', 'tpugh', 'jasonvelez', 'samantha36', 'brandidaniels', 'tenglish', 'reyesbrett', 'austin18', 'vjohnson', 'zmejia', 'daniel04', 'cynthia20', 'morgan15', 'avaldez', 'jessica92', 'laurieholloway', 'baileyvictoria'], 'environmental health practitioner': ['uhebert', 'jonathanchristian', 'xjohnson', 'dsmith', 'james01', 'nancytaylor', 'ztaylor', 'andrewwoods', 'susan54', 'fmaldonado', 'james74', 'bakerjacob', 'stephanie81', 'whitejoseph', 'qolson', 'hknox', 'gonzalesdaniel', 'tranronald', 'jesseg

2.2 Сохраните результаты в файл `job_people.pickle` и в файл `job_people.json` с использованием форматов pickle и JSON соответственно. Сравните объемы получившихся файлов. При сохранении в JSON укажите аргумент `indent`.

In [212]:
pickle_path = 'output/job_people.pickle'
write_pickle(pickle_path, job_people)

In [213]:
json_path = 'output/job_people.json'
write_json(json_path, job_people)

In [214]:
print(f'pickle: {Path(pickle_path).stat().st_size} bytes')
print(f'json: {Path(json_path).stat().st_size} bytes')

pickle: 132393 bytes
json: 407711 bytes


2.3 Считайте файл `job_people.pickle` и продемонстрируйте, что данные считались корректно. 

In [215]:
job_people_pickle = read_pickle(pickle_path)
print({k: v for k, v in list(job_people_pickle.items())[:3]})

{'energy engineer': ['uhebert', 'annmoore', 'garysilva', 'martinezashley', 'sextonsheila', 'pjames', 'smithjonathan', 'wardjames', 'cwheeler', 'ucarlson', 'robert71', 'johnsontheresa', 'amanda41', 'stacey47', 'timothynelson', 'timothynelson', 'rogersmichael', 'melissa94', 'wmcdaniel', 'charles74', 'smithjennifer', 'clintonjones'], 'engineer, site': ['uhebert', 'nancy12', 'andrea03', 'catherineross', 'wesley32', 'natalieross', 'rossdoris', 'christophersmith', 'dbooker', 'ericarobertson', 'trantricia', 'tpugh', 'jasonvelez', 'samantha36', 'brandidaniels', 'tenglish', 'reyesbrett', 'austin18', 'vjohnson', 'zmejia', 'daniel04', 'cynthia20', 'morgan15', 'avaldez', 'jessica92', 'laurieholloway', 'baileyvictoria'], 'environmental health practitioner': ['uhebert', 'jonathanchristian', 'xjohnson', 'dsmith', 'james01', 'nancytaylor', 'ztaylor', 'andrewwoods', 'susan54', 'fmaldonado', 'james74', 'bakerjacob', 'stephanie81', 'whitejoseph', 'qolson', 'hknox', 'gonzalesdaniel', 'tranronald', 'jesseg

### XML

3.1 По данным файла `steps_sample.xml` сформируйте словарь с шагами по каждому рецепту вида `{id_рецепта: ["шаг1", "шаг2"]}`. Сохраните этот словарь в файл `steps_sample.json`

In [233]:
steps_sample_soup = read_xml('data/steps_sample.xml')

In [234]:
recipes_lst = steps_sample_soup.find_all('recipe')

In [235]:
recipe_steps = {}
for recipe in recipes_lst:
    recipe_steps[recipe.find('id').text] = [
        {'description': step.text,
         'has_minutes': step.attrs.get('has_minutes')}
        for step in recipe.find_all('step')
    ]

In [236]:
{k: v for k, v in list(recipe_steps.items())[:3]}

{'44123': [{'description': 'in 1 / 4 cup butter , saute carrots , onion , celery and broccoli stems for 5 minutes',
   'has_minutes': '1'},
  {'description': 'add thyme , oregano and basil', 'has_minutes': None},
  {'description': 'saute 5 minutes more', 'has_minutes': '1'},
  {'description': 'add wine and deglaze pan', 'has_minutes': None},
  {'description': 'add hot chicken stock and reduce by one-third',
   'has_minutes': None},
  {'description': 'add worcestershire sauce , tabasco , smoked chicken , beans and broccoli florets',
   'has_minutes': None},
  {'description': 'simmer 5 minutes', 'has_minutes': '1'},
  {'description': 'add cream , simmer 5 minutes more and season to taste',
   'has_minutes': '1'},
  {'description': 'drop in remaining butter , piece by piece , stirring until melted and serve immediately',
   'has_minutes': None},
  {'description': 'smoked chicken: on a covered grill , slightly smoke boneless chicken , cooking to medium rare',
   'has_minutes': None},
  {'d

In [237]:
write_json('output/steps_sample.json', recipe_steps)

3.2 По данным файла `steps_sample.xml` сформируйте словарь следующего вида: `кол-во_шагов_в_рецепте: [список_id_рецептов]`

In [241]:
step_recipes = defaultdict(list)
for recipe in recipes_lst:
    step_recipes[len(recipe.find_all('step'))].append(recipe.find('id').text)

In [243]:
print({k: v for k, v in list(step_recipes.items())[:3]})

{11: {'171343', '221913', '431305', '190403', '410949', '239061', '404918', '14835', '36806', '172973', '201442', '39137', '63116', '225997', '90884', '64785', '22104', '64701', '118337', '410350', '95516', '251751', '71244', '59256', '356614', '197474', '165473', '503754', '48452', '279797', '108062', '394765', '120331', '22917', '193649', '334290', '120391', '186466', '332582', '461081', '141506', '88306', '243374', '183829', '336795', '277319', '113115', '286105', '107741', '68585', '302367', '34510', '286928', '36950', '136815', '104098', '45213', '228914', '316435', '86534', '409348', '90005', '236900', '19726', '447194', '172164', '384801', '288003', '102457', '418188', '245773', '71804', '109762', '2771', '159934', '246783', '465662', '203423', '236367', '477534', '84616', '438160', '443215', '503121', '83572', '176862', '175109', '427317', '181091', '104222', '171728', '104074', '253676', '240190', '326251', '110022', '399476', '40178', '136378', '34760', '209406', '229779', '4

3.3 Получите список рецептов, в этапах выполнения которых есть информация о времени (часы или минуты). Для отбора подходящих рецептов обратите внимание на атрибуты соответствующих тэгов.

3.4 Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в таблицу `recipes`. Для строк, которые содержат пропуски в столбце `n_steps`, заполните этот столбец на основе файла  `steps_sample.xml`. Строки, в которых столбец `n_steps` заполнен, оставьте без изменений.

3.5 Проверьте, содержит ли столбец `n_steps` пропуски. Если нет, то преобразуйте его к целочисленному типу и сохраните результаты в файл `recipes_sample_with_filled_nsteps.csv`