# Форматы данных (1)

Материалы:
* Макрушин С.В. "Лекция 4: Форматы данных"
* https://docs.python.org/3/library/json.html
* https://docs.python.org/3/library/pickle.html
* https://www.crummy.com/software/BeautifulSoup/bs4/doc.ru/bs4ru.html
* Уэс Маккини. Python и анализ данных

## Задачи для совместного разбора

1. Вывести все адреса электронной почты, содержащиеся в адресной книге `addres-book.json`

In [10]:
import json
import pickle
from collections import defaultdict
from pathlib import Path
import pprint

import pandas as pd
from bs4 import BeautifulSoup as bs

In [11]:
def read_json(path: str):
    with open(path, encoding='utf-8') as f:
        return json.load(f)


def write_json(path: str, data: dict):
    with open(path, mode='w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

In [12]:
def read_xml(path: str) -> bs:
    with open(path, encoding='utf-8') as f:
        return bs(''.join(f.readlines()), 'lxml')

In [13]:
def read_pickle(path: str):
    with open(path, mode='rb') as f:
        return pickle.load(f)


def write_pickle(path: str, data: dict):
    with open(path, mode='wb') as f:
        pickle.dump(data, f)

In [14]:
def pprint_xml(xml, n=10):
    print('\n'.join(xml.prettify().split('\n')[:n]) + '\n...')


def pprint_dict_of_list(dct: dict, n=3, m=10):
    pprint.pprint({k: [*v[:m], '...'] for k, v in list(dct.items())[:n]}, sort_dicts=False)


In [15]:
address_book = read_json('data/addressBook.json')
address_book

[{'name': 'Faina Lee',
  'email': 'faina@mail.ru',
  'birthday': '22.08.1994',
  'phones': [{'phone': '232-19-55'}, {'phone': '+7 (916) 232-19-55'}]},
 {'name': 'Robert Lee',
  'email': 'robert@mail.ru',
  'birthday': '22.08.1994',
  'phones': [{'phone': '111-19-55'}, {'phone': '+7 (916) 445-19-55'}]}]

In [16]:
emails = [address.get('email') for address in address_book]
emails

['faina@mail.ru', 'robert@mail.ru']

2. Вывести телефоны, содержащиеся в адресной книге `addres-book.json`

In [17]:
phones = []
for address in address_book:
    for phone in address.get('phones'):
        phones.append(phone['phone'])

phones

['232-19-55', '+7 (916) 232-19-55', '111-19-55', '+7 (916) 445-19-55']

3. По данным из файла `addres-book-q.xml` сформировать список словарей с телефонами каждого из людей.

In [18]:
soup = read_xml('data/address_book.xml')
pprint_xml(soup)

<?xml version="1.0" encoding="UTF-8" ?>
<html>
 <body>
  <address_book>
   <country name="algeria">
    <address id="1">
     <gender>
      m
     </gender>
     <name>
...


In [19]:
phone_book = []
for address in soup.find_all('address'):
    name = address.find('name').string
    phones = [phone.string for phone in address.find_all('phone')]
    phone_book.append({
        'name': name,
        'phones': phones
    })

phone_book

[{'name': 'Aicha Barki', 'phones': ['+ (213) 6150 4015', '+ (213) 2173 5247']},
 {'name': 'Francisco Domingos',
  'phones': ['+ (244-2) 325 023', '+ (244-2) 325 023']},
 {'name': 'Maria Luisa', 'phones': ['+ (244) 4232 2836']},
 {'name': 'Abraao Chanda',
  'phones': ['+ (244-2) 325 023', '+ (244-2) 325 023']},
 {'name': 'Beatriz Busaniche', 'phones': ['+ (54-11) 4784 1159']},
 {'name': 'Francesca Beddie',
  'phones': ['+ (61-2) 6274 9500', '+ (61-2) 6274 9513']},
 {'name': 'Graham John Smith', 'phones': ['+ (61-3) 9807 4702']}]

## Лабораторная работа №4

### JSON

1.1 Считайте файл `contributors_sample.json`. Воспользовавшись модулем `json`, преобразуйте содержимое файла в соответствующие объекты python. Выведите на экран информацию о первых 3 пользователях.

In [20]:
contributors = read_json('data/contributorsSample.json')
contributors[:3]

[{'username': 'uhebert',
  'name': 'Lindsey Nguyen',
  'sex': 'F',
  'address': '01261 Cameron Spring\nTaylorfurt, AK 97791',
  'mail': 'jsalazar@gmail.com',
  'jobs': ['Energy engineer',
   'Engineer, site',
   'Environmental health practitioner',
   'Biomedical scientist',
   'Jewellery designer'],
  'id': 35193},
 {'username': 'vickitaylor',
  'name': 'Cheryl Lewis',
  'sex': 'F',
  'address': '66992 Welch Brooks\nMarshallshire, ID 56004',
  'mail': 'bhudson@gmail.com',
  'jobs': ['Music therapist',
   'Volunteer coordinator',
   'Designer, interior/spatial'],
  'id': 91970},
 {'username': 'sheilaadams',
  'name': 'Julia Allen',
  'sex': 'F',
  'address': 'Unit 1632 Box 2971\nDPO AE 23297',
  'mail': 'darren44@yahoo.com',
  'jobs': ['Management consultant',
   'Engineer, structural',
   'Lecturer, higher education',
   'Theatre manager',
   'Designer, textile'],
  'id': 1848091}]

1.2 Выведите уникальные почтовые домены, содержащиеся в почтовых адресах людей

In [21]:
mail_domains = set()
for contributor in contributors:
    mail_domains.add(contributor['mail'].split('@')[-1])

mail_domains

{'gmail.com', 'hotmail.com', 'yahoo.com'}

1.3 Напишите функцию, которая по `username` ищет человека и выводит информацию о нем. Если пользователь с заданным `username` отсутствует, возбудите исключение `ValueError`

In [22]:
# допущение: username является уникальным
indexed_data = {contributor['username']: contributor for contributor in contributors}

In [23]:
def find_by_username(username: str) -> dict:
    contributor = indexed_data.get(username)
    if contributor is None:
        raise ValueError(f'Контрибьютер с {username=} не найден.')

    return contributor

In [24]:
find_by_username('uhebert')

{'username': 'uhebert',
 'name': 'Lindsey Nguyen',
 'sex': 'F',
 'address': '01261 Cameron Spring\nTaylorfurt, AK 97791',
 'mail': 'jsalazar@gmail.com',
 'jobs': ['Energy engineer',
  'Engineer, site',
  'Environmental health practitioner',
  'Biomedical scientist',
  'Jewellery designer'],
 'id': 35193}

In [25]:
try:
    find_by_username('aabbcc')
except ValueError as e:
    print(e)

Контрибьютер с username='aabbcc' не найден.


1.4 Посчитайте, сколько мужчин и женщин присутсвует в этом наборе данных.

In [26]:
def count_by_sex(data):
    counter = defaultdict(int)  # спасибо, что я знаю тебя
    for contributor in data:
        counter[contributor['sex'].upper()] += 1

    return {'M': counter['M'], 'F': counter['F']}

In [27]:
count_by_sex(contributors)

{'M': 2064, 'F': 2136}

1.5 Создайте `pd.DataFrame` `contributors`, имеющий столбцы `id`, `username` и `sex`.

In [28]:
contributors_df = pd.DataFrame([
    {
        'id': contributor['id'],
        'username': contributor['username'],
        'sex': contributor['sex']
    }
    for contributor in contributors
])

contributors_df

Unnamed: 0,id,username,sex
0,35193,uhebert,F
1,91970,vickitaylor,F
2,1848091,sheilaadams,F
3,50969,nicole82,F
4,676820,jean67,M
...,...,...,...
4195,423555,stevenspencer,F
4196,35251,rwilliams,M
4197,135887,lmartinez,F
4198,212714,brendahill,M


1.6 Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в таблицу `recipes`. Объедините `recipes` с таблицей `contributors` с сохранением строк в том случае, если информация о человеке отсутствует в JSON-файле. Для скольких человек информация отсутствует? 

In [29]:
recipes_df = pd.read_csv('data/recipes_sample.csv', sep=',', parse_dates=['submitted'])
recipes_df

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,
29998,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,


In [30]:
recipes_contributors_df = pd.merge(
    recipes_df.drop_duplicates('contributor_id'),
    contributors_df,
    how='left',
    left_on='contributor_id',
    right_on='id'
)

recipes_contributors_df

Unnamed: 0,name,id_x,minutes,contributor_id,submitted,n_steps,description,n_ingredients,id_y,username,sex
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,35193.0,uhebert,F
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,91970.0,vickitaylor,F
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,,,
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,,,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,,,
...,...,...,...,...,...,...,...,...,...,...,...
8399,zucchini strips,279769,30,522304,2008-01-18,8.0,"spicy, salty, crispy zucchini strips!",4.0,,,
8400,zucchini with bacon corn peppers,326105,45,896136,2008-09-19,8.0,this is a very colorful addition to any meal. ...,7.0,,,
8401,zucchini with bell pepper and tomato,363362,13,344321,2009-03-29,19.0,the weather has been turning warmer and i have...,8.0,344321.0,mistyray,F
8402,zucchini with serrano ham,162411,15,152500,2006-03-31,6.0,"this dish is from tim malzer, a german chef wh...",5.0,,,


In [31]:
recipes_contributors_df['id_y'].isna().sum()

4204

In [32]:
pd.Series(list(
    set(recipes_df['contributor_id'].unique()) - set(contributors_df['id'].unique())
))

0        622593
1        557057
2       1105923
3         32772
4       1228804
         ...   
4199    2924524
4200     237551
4201     458738
4202     114681
4203      57338
Length: 4204, dtype: int64

### pickle

2.1 На основе файла `contributors_sample.json` создайте словарь следующего вида: 
```
{
    должность: [список username людей, занимавших эту должность]
}
```

In [33]:
job_people = defaultdict(list)
for contributor in contributors:
    for job in set(contributor['jobs']):
        job_people[job.strip().casefold()].append(contributor['username'])

pprint_dict_of_list(job_people)

{'biomedical scientist': ['uhebert',
                          'smithheather',
                          'epittman',
                          'scotttyrone',
                          'limadeline',
                          'robertsmith',
                          'friedmanronald',
                          'sarahirwin',
                          'xmiller',
                          'jeremy66',
                          '...'],
 'environmental health practitioner': ['uhebert',
                                       'jonathanchristian',
                                       'xjohnson',
                                       'dsmith',
                                       'james01',
                                       'nancytaylor',
                                       'ztaylor',
                                       'andrewwoods',
                                       'susan54',
                                       'fmaldonado',
                                       '...'],


2.2 Сохраните результаты в файл `job_people.pickle` и в файл `job_people.json` с использованием форматов pickle и JSON соответственно. Сравните объемы получившихся файлов. При сохранении в JSON укажите аргумент `indent`.

In [34]:
pickle_path = 'data/output/job_people.pickle'
write_pickle(pickle_path, job_people)

In [35]:
json_path = 'data/output/job_people.json'
write_json(json_path, job_people)

In [36]:
print(f'pickle: {Path(pickle_path).stat().st_size} bytes')
print(f'json: {Path(json_path).stat().st_size} bytes')

pickle: 132168 bytes
json: 406637 bytes


2.3 Считайте файл `job_people.pickle` и продемонстрируйте, что данные считались корректно. 

In [37]:
job_people_pickle = read_pickle(pickle_path)
pprint_dict_of_list(job_people_pickle)

{'biomedical scientist': ['uhebert',
                          'smithheather',
                          'epittman',
                          'scotttyrone',
                          'limadeline',
                          'robertsmith',
                          'friedmanronald',
                          'sarahirwin',
                          'xmiller',
                          'jeremy66',
                          '...'],
 'environmental health practitioner': ['uhebert',
                                       'jonathanchristian',
                                       'xjohnson',
                                       'dsmith',
                                       'james01',
                                       'nancytaylor',
                                       'ztaylor',
                                       'andrewwoods',
                                       'susan54',
                                       'fmaldonado',
                                       '...'],


### XML

3.1 По данным файла `steps_sample.xml` сформируйте словарь с шагами по каждому рецепту вида `{id_рецепта: ["шаг1", "шаг2"]}`. Сохраните этот словарь в файл `steps_sample.json`

In [38]:
steps_sample_soup = read_xml('data/steps_sample.xml')

In [39]:
recipes_lst = steps_sample_soup.find_all('recipe')

In [40]:
recipe_steps = {}
for recipe in recipes_lst:
    recipe_steps[recipe.find('id').text] = [
        {
            'description': step.text,
            **step.attrs
        }
        for step in recipe.find_all('step')
    ]

In [41]:
pprint_dict_of_list(recipe_steps)

{'44123': [{'description': 'in 1 / 4 cup butter , saute carrots , onion , '
                           'celery and broccoli stems for 5 minutes',
            'has_minutes': '1'},
           {'description': 'add thyme , oregano and basil'},
           {'description': 'saute 5 minutes more', 'has_minutes': '1'},
           {'description': 'add wine and deglaze pan'},
           {'description': 'add hot chicken stock and reduce by one-third'},
           {'description': 'add worcestershire sauce , tabasco , smoked '
                           'chicken , beans and broccoli florets'},
           {'description': 'simmer 5 minutes', 'has_minutes': '1'},
           {'description': 'add cream , simmer 5 minutes more and season to '
                           'taste',
            'has_minutes': '1'},
           {'description': 'drop in remaining butter , piece by piece , '
                           'stirring until melted and serve immediately'},
           {'description': 'smoked chicken: on a 

In [42]:
write_json('data/output/steps_sample.json', recipe_steps)

3.2 По данным файла `steps_sample.xml` сформируйте словарь следующего вида: `кол-во_шагов_в_рецепте: [список_id_рецептов]`

In [43]:
step_recipes = defaultdict(list)
for recipe in recipes_lst:
    step_recipes[len(recipe.find_all('step'))].append(recipe.find('id').text)

In [44]:
pprint_dict_of_list(step_recipes)

{11: ['44123',
      '302399',
      '375376',
      '140610',
      '374703',
      '111198',
      '257111',
      '432661',
      '114204',
      '63069',
      '...'],
 3: ['67664',
     '118843',
     '147477',
     '367987',
     '216068',
     '375362',
     '367828',
     '286484',
     '27060',
     '121712',
     '...'],
 5: ['38798',
     '69190',
     '125195',
     '306590',
     '77380',
     '378067',
     '218671',
     '453243',
     '101260',
     '416599',
     '...']}


3.3 Получите список рецептов, в этапах выполнения которых есть информация о времени (часы или минуты). Для отбора подходящих рецептов обратите внимание на атрибуты соответствующих тэгов.

In [45]:
recipes_with_time_info = []
for recipe in recipes_lst:
    if recipe.find_all(
            lambda x: x.name == 'step' and (x.attrs.get('has_minutes') or x.attrs.get('has_hours'))
    ):
        recipes_with_time_info.append(recipe.find('id').text)

In [46]:
print(len(recipes_with_time_info))
recipes_with_time_info[:10]

23469


['44123',
 '35173',
 '453467',
 '306168',
 '50662',
 '118843',
 '149593',
 '200148',
 '310570',
 '95534']

3.4 Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в таблицу `recipes`. Для строк, которые содержат пропуски в столбце `n_steps`, заполните этот столбец на основе файла  `steps_sample.xml`. Строки, в которых столбец `n_steps` заполнен, оставьте без изменений.

In [47]:
steps_df = pd.DataFrame([
    {
        'id': int(recipe.find('id').text),
        'n_steps': len(recipe.find_all('step')),
    }
    for recipe in recipes_lst
])

steps_df

Unnamed: 0,id,n_steps
0,44123,11
1,67664,3
2,38798,5
3,35173,7
4,84797,4
...,...,...
29995,267661,16
29996,386977,22
29997,103312,10
29998,486161,7


In [48]:
recipes_steps_df = pd.merge(
    recipes_df,
    steps_df,
    how='left',
    on='id'
)

recipes_steps_df

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps_x,description,n_ingredients,n_steps_y
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,11
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,3
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,5
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,7
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,4
...,...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0,16
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0,22
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,,10
29998,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,7


In [49]:
mask = recipes_steps_df['n_steps_x'].isna()
recipes_df.loc[mask, 'n_steps'] = recipes_steps_df[mask]['n_steps_y']

In [50]:
recipes_df

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,11.0,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,3.0,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,5.0,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,7.0,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,22.0,"this is a traditional fresh plum cake, thought...",11.0
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,10.0,this is a traditional late summer early fall s...,
29998,zydeco soup,486161,60,227978,2012-08-29,7.0,this is a delicious soup that i originally fou...,


3.5 Проверьте, содержит ли столбец `n_steps` пропуски. Если нет, то преобразуйте его к целочисленному типу и сохраните результаты в файл `recipes_sample_with_filled_nsteps.csv`

In [51]:
if recipes_df['n_steps'].isna().sum() == 0:
    recipes_df['n_steps'] = recipes_df['n_steps'].astype(int)
    recipes_df.to_csv('data/output/recipes_sample_with_filled_nsteps.csv', sep=',')
