# Мини задание 1

В рамках задания необходимо развернуть инфраструктуру в облаке Azure, предобработать данные и залить их в хранилище.

Работать будет с датасетом из соревнования на Kaggle - https://www.kaggle.com/c/avito-context-ad-clicks . Для работы с ним потребуется аккаунт на Kaggle - заведите его, если его еще нет.

В файле `VisitsStream.tsv` содержится информация о посещениях пользователями сайта в хронологическом порядке. Необходимо взять первые 1 000 000 записей из этой таблицы и подсчитать для них топ-10 пользователей (userId) по числу посещений. Количество этих посещений также должно быть отображено в результате. Все этапы -  загрузка, преобразования и выгрузка данных - необходимо произвести с использованием Bash (не скрипты на Python).

Чтобы скачать файл, воспользуйтесь `kaggle CLI` - https://github.com/Kaggle/kaggle-api.

Полученный файл необходимо загрузить в облачное хранилище таким образом, чтобы к нему был публичный доступ извне. Ссылка на этот файл должна также присутствовать в сдаваемом решении.

Пример файла с результатом
```
   5213 45743577
   2617 764540
    794 4353457
    323 3093437
    124 4293412
    78 773445
    56 3665544
    43 323413
    36 2632454
    32 1124347
```

В качестве выполненного домашнего задания пришлите Jupyter Notebook в формате `.ipynb` (ссылку для сдачи смотрите на Wiki). В названии файла укажите свое имя и фамилию. В этом ноутбуке должны быть отображены

1. Конфигурация Terraform для создания в облаке виртуальной машины и облачного хранилища с одним контейнером.
2. Скрипт Bash, который вы запускали на машине для обработки данных и выгрузки их в облако
3. Ссылка на выгруженный файл в вашем облаке

**Важно!** Не удаляйте облачное хранилище до конца проверки домашнего задания! Файл будет скачан проверяющим и если до него не будет доступа, то это отразится на оценке за задание.

**Важно!** Не палите свои пароли для kaggle (или любые другие пароли тоже). Мы конечно не собираемся ничего с ними делать, но полезно потренировать свои навыки информационной безопасности.

*Заметка* Учитывайте, что адреса DNS или названия аккаунтов хранения уникальный по всему Azure. Поэтому просто скопировать пример из семинара не получится - нужно как минимум внести корректировки в эти названия.

## Terraform

In [2]:
! az login -u arina.ruck@gmail.com -p $(cat password-file.txt)

[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "fa34e92c-e581-439a-b6ae-b41b61270513",
    "id": "dfeaef04-2e5c-4480-8ebf-301863ed526a",
    "isDefault": true,
    "managedByTenants": [],
    "name": "Microsoft Azure Sponsorship 2",
    "state": "Disabled",
    "tenantId": "fa34e92c-e581-439a-b6ae-b41b61270513",
    "user": {
      "name": "arina.ruck@gmail.com",
      "type": "user"
    }
  },
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "fa34e92c-e581-439a-b6ae-b41b61270513",
    "id": "3750c13c-7871-4812-9fd5-94fb4ce86681",
    "isDefault": false,
    "managedByTenants": [],
    "name": "Lab1 21 Rak Arina Sergeevna",
    "state": "Enabled",
    "tenantId": "fa34e92c-e581-439a-b6ae-b41b61270513",
    "user": {
      "name": "arina.ruck@gmail.com",
      "type": "user"
    }
  }
]
[0m

In [3]:
! mkdir -p my-first-cloud

%cd my-first-cloud/

/Users/arinaruck/Desktop/cs/lsml/lsml-2021-public/my-first-cloud


In [4]:
! terraform init


[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m
- Reusing previous version of hashicorp/azurerm from the dependency lock file
- Installing hashicorp/azurerm v2.40.0...
- Installed hashicorp/azurerm v2.40.0 (signed by HashiCorp)

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0m


In [5]:
with open('../password-file.txt', 'r') as f:
    my_password = f.read()

In [6]:
%%writefile install-conda.sh

# get link from this site - https://www.anaconda.com/distribution/

wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh -O install-conda.sh
chmod +x install-conda.sh
./install-conda.sh -b -p $HOME/anaconda3

# add conda to PATH

eval "$(/home/azureuser/anaconda3/bin/conda shell.bash hook)"
conda init

Overwriting install-conda.sh


In [7]:
%%writefile common.tf

# Тут будут описаны общие ресурсы

provider "azurerm" {
  # Провайдер, с которым будем работать. В данном случае - Azure
  subscription_id = "3750c13c-7871-4812-9fd5-94fb4ce86681"
  features {}
}

# Группа ресурсов. К ней будут линковаться все осталные ресурсы
resource "azurerm_resource_group" "lsml_rg" {
  name = "lsml-resource-group"
  location = "westus"
}

# Виртуальная сеть внутри облака
resource "azurerm_virtual_network" "lsml_vn" {
  resource_group_name = azurerm_resource_group.lsml_rg.name
  location = azurerm_resource_group.lsml_rg.location
    
  name = "lsml-vitrual-network"

  address_space = ["10.0.0.0/16"]
  # Пул адресов внутри сети
}

# Виртуальная подсеть внутри облака
resource "azurerm_subnet" "lsml_subnet" {
  resource_group_name = azurerm_resource_group.lsml_rg.name
  virtual_network_name = azurerm_virtual_network.lsml_vn.name
    
  name = "internal"
    
  address_prefixes = ["10.0.2.0/24"]
}

Overwriting common.tf


In [8]:
%%writefile vm.tf
# Здесь описываем ресурсы, необходимые для работы машины

# Публичный адрес машины
resource "azurerm_public_ip" "lsml_pub_ip" {
  location = azurerm_resource_group.lsml_rg.location
  resource_group_name = azurerm_resource_group.lsml_rg.name

  name = "lsml-public-ip"

  allocation_method = "Dynamic"
  # Выдаем динамический ip
  idle_timeout_in_minutes = 30
  domain_name_label = "arinkalsmlmachine"
  # Для удобства можно использовать DNS имя
}

# Сетевой интерфейс для нашей машины
resource "azurerm_network_interface" "lsml_ni" {
  location = azurerm_resource_group.lsml_rg.location
  resource_group_name = azurerm_resource_group.lsml_rg.name

  name = "lsml-nic"

  ip_configuration {
    name = "internal"
    private_ip_address_allocation = "Dynamic"
    subnet_id = azurerm_subnet.lsml_subnet.id
    public_ip_address_id = azurerm_public_ip.lsml_pub_ip.id
  }
}

# Сама виртуальная машина
resource "azurerm_virtual_machine" "lsml_vm" {
  resource_group_name = azurerm_resource_group.lsml_rg.name
  location = azurerm_resource_group.lsml_rg.location

  name = "lsml-machine"

  vm_size = "Standard_F2"
  # 2 CPU 4 GB
  # Какие размеры вообще бывают - https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/

  network_interface_ids = [
    azurerm_network_interface.lsml_ni.id,
    # Подключаем к сети
  ]

  storage_image_reference {
    # Используем образ Ubuntu 16.04
    publisher = "Canonical"
    offer = "UbuntuServer"
    sku = "16.04-LTS"
    version = "latest"
  }

  os_profile {
    computer_name = "hostname"
    admin_username = "azureuser"
    # Пользователь и его пароль
    admin_password = "Password1234!"
  }
  os_profile_linux_config {
    disable_password_authentication = false
    ssh_keys {
      path = "/home/azureuser/.ssh/authorized_keys"
      key_data = file("~/.ssh/id_rsa.pub")
      # Можем указать наш ключ, как ключ для авторизации на машине
    }
  }

  delete_os_disk_on_termination = true
  storage_os_disk {
    name = "main-disk"
    caching = "ReadWrite"
    create_option = "FromImage"
    managed_disk_type = "Standard_LRS"
    disk_size_gb = "200"
    # Указываем какой использовать основной жесткий диск и его размер
  }

  provisioner "remote-exec" { # Запускаем удаленно
    script = "install-conda.sh" # Указываем, какой скрипт запустить

    connection {
      # Указываем, как подключиться к машине. Будем использовать ssh с паролем и доменным именем
      type = "ssh"
      user = "azureuser"
      password = "Password1234!"
      host = "${azurerm_public_ip.lsml_pub_ip.domain_name_label}.${azurerm_resource_group.lsml_rg.location}.cloudapp.azure.com"
    }
  }
}

# Созраним несколько важный значений в output. Их можно будет потом использовать для наших целей

data "azurerm_public_ip" "lsml_public_ip" {
  name = azurerm_public_ip.lsml_pub_ip.name
  resource_group_name = azurerm_virtual_machine.lsml_vm.resource_group_name
}

output "public_domain" {
  # Домен нашего сервера
  value = "${azurerm_public_ip.lsml_pub_ip.domain_name_label}.${azurerm_resource_group.lsml_rg.location}.cloudapp.azure.com"
}

output "public_ip" {
  # Публичный ip нашего сервера
  value = data.azurerm_public_ip.lsml_public_ip.ip_address
}

output "private_ip" {
  # Приватный ip нашего сервера
  value = azurerm_network_interface.lsml_ni.private_ip_address
}

Overwriting vm.tf


In [9]:
%%writefile blob.tf
# Аккаунт хранения
resource "azurerm_storage_account" "lsml_sa" {
  name = "arinkalsmlstorage"
  resource_group_name = azurerm_resource_group.lsml_rg.name
  location = azurerm_resource_group.lsml_rg.location
  account_tier = "Standard"
  account_replication_type = "LRS"
  allow_blob_public_access = true
  # По умолчанию доступ только приватный. Добавим публичный доступ.
}

# Конейнер внутри WASB
resource "azurerm_storage_container" "lsml_sc" {
  name = "content"
  # Как наш конейтнер с данными будет называться
  storage_account_name = azurerm_storage_account.lsml_sa.name
  container_access_type = "blob"
}


# По умолчанию у основного пользователя ограниченные права на работу с WASB.
# Выдалим доступ уровня "Storage Blob Data Contributor"
data "azurerm_subscription" "primary" {
}

data "azurerm_client_config" "example" {
}

resource "azurerm_role_assignment" "example" {
   scope = data.azurerm_subscription.primary.id
   principal_id = data.azurerm_client_config.example.object_id

   role_definition_name = "Storage Blob Data Contributor"
}

Overwriting blob.tf


In [10]:
! echo "yes" | terraform apply


An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  [32m+[0m create
 [36m<=[0m read (data resources)
[0m
Terraform will perform the following actions:

[1m  # data.azurerm_public_ip.lsml_public_ip[0m will be read during apply
  # (config refers to values not yet known)[0m[0m
[0m [36m<=[0m[0m data "azurerm_public_ip" "lsml_public_ip"  {
      [32m+[0m [0m[1m[0mallocation_method[0m[0m       = (known after apply)
      [32m+[0m [0m[1m[0mdomain_name_label[0m[0m       = (known after apply)
      [32m+[0m [0m[1m[0mfqdn[0m[0m                    = (known after apply)
      [32m+[0m [0m[1m[0mid[0m[0m                      = (known after apply)
      [32m+[0m [0m[1m[0midle_timeout_in_minutes[0m[0m = (known after apply)
      [32m+[0m [0m[1m[0mip_address[0m[0m              = (known after apply)
      [32m+[0m [0m[1m[0mip_version[0m[0m              = (kno

      [32m+[0m [0m[1m[0msecondary_web_host[0m[0m               = (known after apply)

      [32m+[0m [0mblob_properties {
          [32m+[0m [0mcors_rule {
              [32m+[0m [0m[1m[0mallowed_headers[0m[0m    = (known after apply)
              [32m+[0m [0m[1m[0mallowed_methods[0m[0m    = (known after apply)
              [32m+[0m [0m[1m[0mallowed_origins[0m[0m    = (known after apply)
              [32m+[0m [0m[1m[0mexposed_headers[0m[0m    = (known after apply)
              [32m+[0m [0m[1m[0mmax_age_in_seconds[0m[0m = (known after apply)
            }

          [32m+[0m [0mdelete_retention_policy {
              [32m+[0m [0m[1m[0mdays[0m[0m = (known after apply)
            }
        }

      [32m+[0m [0midentity {
          [32m+[0m [0m[1m[0mprincipal_id[0m[0m = (known after apply)
          [32m+[0m [0m[1m[0mtenant_id[0m[0m    = (known after apply)
          [32m+[0m [0m[1m[0mty

[0m[1mazurerm_resource_group.lsml_rg: Creating...[0m[0m
[0m[1mazurerm_role_assignment.example: Creating...[0m[0m
[0m[1mazurerm_resource_group.lsml_rg: Creation complete after 4s [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group][0m[0m
[0m[1mazurerm_public_ip.lsml_pub_ip: Creating...[0m[0m
[0m[1mazurerm_virtual_network.lsml_vn: Creating...[0m[0m
[0m[1mazurerm_storage_account.lsml_sa: Creating...[0m[0m
[0m[1mazurerm_role_assignment.example: Still creating... [10s elapsed][0m[0m
[0m[1mazurerm_public_ip.lsml_pub_ip: Creation complete after 10s [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group/providers/Microsoft.Network/publicIPAddresses/lsml-public-ip][0m[0m
[0m[1mazurerm_virtual_network.lsml_vn: Still creating... [10s elapsed][0m[0m
[0m[1mazurerm_storage_account.lsml_sa: Still creating... [10s elapsed][0m[0m
[0m[1mazurerm_virtual_network.lsml_vn: Creation complete a

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : astroid-2.4.2-py38_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : astroid-2.4.2-py38_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : click-7.1.2-py_0.conda:   4
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : simplegeneric-0.8.1-py38_2.
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : boto-2.49.0-py38_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : boto-2.49.0-py38_0.conda:
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : harfbuzz-2.4.0-hca77d97_1.c
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : backports.weakref-1.0.post1
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : backports.weakref-1.0.post1
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : et_xmlfile-1.0.1-py_1001.co
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : pyzmq-19.0.2-py38he6710b0_1
azurerm_virt

[0m[1mazurerm_virtual_machine.lsml_vm: Still creating... [3m10s elapsed][0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : ipython-7.19.0-py38hb070fc8
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : xmltodict-0.12.0-py_0.conda
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : fsspec-0.8.3-py_0.conda:  2
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : pyrsistent-0.17.3-py38h7b64
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : openssl-1.1.1h-h7b6447c_0.c
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : openssl-1.1.1h-h7b6447c_0.c
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : bottleneck-1.3.2-py38heb32a
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : diff-match-patch-20200713-p
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : packaging-20.4-py_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : mkl_random-1.1.1-py38h0573a
azurerm_virtual_machin

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : mpmath-1.1.0-py38_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : wrapt-1.11.2-py38h7b6447c_0
[0m[1mazurerm_virtual_machine.lsml_vm: Still creating... [3m30s elapsed][0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : scikit-image-0.17.2-py38hdf
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : wcwidth-0.2.5-py_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : wcwidth-0.2.5-py_0.conda:
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : intel-openmp-2020.2-254.con
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : libspatialindex-1.9.3-he671
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : pixman-0.40.0-h7b6447c_0.co
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : pixman-0.40.0-h7b6447c_0.co
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : joblib-0.17.0-py_0.conda:
azurerm_virtual_mach

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : cython-0.29.21-py38he6710b0
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : cython-0.29.21-py38he6710b0
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : sphinxcontrib-jsmath-1.0.1-
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : ipykernel-5.3.4-py38h5ca1d4
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : nbclient-0.5.1-py_0.conda:
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Extracting : sympy-1.6.2-py38h06a4308_1.
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : sympy-1.6.2-py38h06a4308_1.
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : threadpoolctl-2.1.0-pyh5ca1
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : sortedcontainers-2.2.2-py_0
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : contextlib2-0.6.0.post1-py_
azurerm_virtual_machine.lsml_vm (remote-exec): Extracting : gstreamer-1.14.0-hb31296c_0
[0m[0mazurerm_v

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)/
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)-
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)/
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)-
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)/
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)done
[0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): ## Package Plan ##
[0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   environment location: /home/azureuser/anaconda3
[0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   added / updated specs:
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exe

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - kiwisolver==1.3.0=py38h2531618_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - krb5==1.18.2=h173b8e3_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - lazy-object-proxy==1.4.3=py38h7b6447c_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - lcms2==2.11=h396b838_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - ld_impl_linux-64==2.33.1=h53a641e_7
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - libarchive==3.4.2=h62408e4_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - libcurl==7.71.1=h20c2e04_1
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - libedit==3.1.20191231=h14c3975_1
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - libffi==3.3=he6710b0_2
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - libgcc-ng==9.1.0=hdf63c60_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - libgfortran-ng==7.3.0=h

[0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):  - readline==8.0=h7b6447c_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - regex==2020.10.15=py38h7b6447c_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - requests==2.24.0=py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - ripgrep==12.1.1=0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - rope==0.18.0=py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - rtree==0.9.4=py38_1
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - ruamel_yaml==0.15.87=py38h7b6447c_1
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - scikit-image==0.17.2=py38hdf5156a_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - scikit-learn==0.23.2=py38h0573a6f_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - scipy==1.5.2=py38h0b6359f_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):     - seaborn==0.11.0=py_0
[0m[0mazurerm_virtual_machin

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   gstreamer          pkgs/main/linux-64::gstreamer-1.14.0-hb31296c_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   h5py               pkgs/main/linux-64::h5py-2.10.0-py38h7918eee_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   harfbuzz           pkgs/main/linux-64::harfbuzz-2.4.0-hca77d97_1
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   hdf5               pkgs/main/linux-64::hdf5-1.10.4-hb1b8bf9_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   heapdict           pkgs/main/noarch::heapdict-1.0.1-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   html5lib           pkgs/main/noarch::html5lib-1.1-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   icu                pkgs/main/linux-64::icu-58.2-he6710b0_3
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   idna               pkgs/main/noarch::idna-2.10-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   python-dateutil    pkgs/main/noarch::python-dateutil-2.8.1-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   python-jsonrpc-se~ pkgs/main/noarch::python-jsonrpc-server-0.4.0-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   python-language-s~ pkgs/main/noarch::python-language-server-0.35.1-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   python-libarchive~ pkgs/main/noarch::python-libarchive-c-2.9-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   pytz               pkgs/main/noarch::pytz-2020.1-py_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   pywavelets         pkgs/main/linux-64::pywavelets-1.1.1-py38h7b6447c_2
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   pyxdg              pkgs/main/noarch::pyxdg-0.27-pyhd3eb1b0_0
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec):   pyyaml             pkgs/main/linux-64::pyyaml-5.3.1-py38h7b6447c_1
[0m[0mazu

[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec): Preparing transaction: -
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)/
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)-
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)/
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)-
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[1mazurerm_virtual_machine.lsml_vm: Still creating... [4m0s elapsed][0m[0m
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)-
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)\
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)-
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)|
[0m[0mazurerm_virtual_machine.lsml_vm (remote-exec)/
[0m[0mazure

[0m[1mazurerm_virtual_machine.lsml_vm: Creation complete after 4m26s [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group/providers/Microsoft.Compute/virtualMachines/lsml-machine][0m[0m
[0m[1mdata.azurerm_public_ip.lsml_public_ip: Reading...[0m[0m
[0m[1mdata.azurerm_public_ip.lsml_public_ip: Read complete after 1s [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group/providers/Microsoft.Network/publicIPAddresses/lsml-public-ip][0m[0m
[0m[1m[32m
Apply complete! Resources: 9 added, 0 changed, 0 destroyed.[0m
[0m[1m[32m
Outputs:

private_ip = "10.0.2.4"
public_domain = "arinkalsmlmachine.westus.cloudapp.azure.com"
public_ip = "23.101.201.118"[0m


## Bash script

In [None]:
%%bash

# локально, чтобы перенести kaggle.json (или можно было в переменные окружения)

scp .kaggle/kaggle.json azureuser@arinkalsmlmachine.westus.cloudapp.azure.com:/home/azureuser/.kaggle/
    
# на сервере    

chmod 600 /home/azureuser/.kaggle/kaggle.json

# ставлю все, что потребуется
pip install kaggle
sudo apt-get install unzip
sudo apt-get update && sudo apt-get install p7zip-full -y
wget https://aka.ms/InstallAzureCLIDeb -O - | sudo bash

# распаковываю все и нахожу топ-10
kaggle competitions download -c avito-context-ad-clicks
unzip avito-context-ad-clicks.zip -d clicks
7z e clicks/VisitsStream.tsv.7z -y
head -n 1000001 VisitsStream.tsv | tail -n +2 | cut -d$'\t' -f1 > ids.txt
cat ids.txt | sort | uniq -c | sort -k1,1 -n -r | head -n 10 > res.txt

# в azure логинилась вот так
az login --use-device-code

# загружаю в blob
az storage blob upload --account-name arinkalsmlstorage --container-name content --file res.txt --name c_res.txt

## File link

https://arinkalsmlstorage.blob.core.windows.net/content/c_res.txt

## Releasing resources

In [11]:
%%writefile vm.tf
# Здесь описываем ресурсы, необходимые для работы машины

# Публичный адрес машины
resource "azurerm_public_ip" "lsml_pub_ip" {
  location = azurerm_resource_group.lsml_rg.location
  resource_group_name = azurerm_resource_group.lsml_rg.name

  name = "lsml-public-ip"

  allocation_method = "Dynamic" # Выдаем динамический ip
  idle_timeout_in_minutes = 30
  domain_name_label = "alexlsmlmachine" # Для удобства можно использовать DNS имя
}

# Сетевой интерфейс для нашей машины
resource "azurerm_network_interface" "lsml_ni" {
  location = azurerm_resource_group.lsml_rg.location
  resource_group_name = azurerm_resource_group.lsml_rg.name

  name = "lsml-nic"

  ip_configuration {
    name = "internal"
    private_ip_address_allocation = "Dynamic"
    subnet_id = azurerm_subnet.lsml_subnet.id
    public_ip_address_id = azurerm_public_ip.lsml_pub_ip.id
  }
}

Overwriting vm.tf


In [12]:
! echo "yes" | terraform apply

[0m[1mazurerm_resource_group.lsml_rg: Refreshing state... [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group][0m
[0m[1mazurerm_virtual_machine.lsml_vm: Refreshing state... [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group/providers/Microsoft.Compute/virtualMachines/lsml-machine][0m
[0m[1mazurerm_role_assignment.example: Refreshing state... [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/providers/Microsoft.Authorization/roleAssignments/6a82a025-f6d9-2c9c-e4f5-b98db3c9b0d0][0m
[0m[1mazurerm_virtual_network.lsml_vn: Refreshing state... [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group/providers/Microsoft.Network/virtualNetworks/lsml-vitrual-network][0m
[0m[1mazurerm_public_ip.lsml_pub_ip: Refreshing state... [id=/subscriptions/3750c13c-7871-4812-9fd5-94fb4ce86681/resourceGroups/lsml-resource-group/providers/Microsoft.Network/publicIPAddresses/lsm

[0m[1mazurerm_virtual_machine.lsml_vm: Still destroying... [id=/subscriptions/3750c13c-7871-4812-9fd5-...t.Compute/virtualMachines/lsml-machine, 10s elapsed][0m[0m
[0m[1mazurerm_virtual_machine.lsml_vm: Destruction complete after 15s[0m[0m
[31m
[1m[31mError: [0m[0m[1mError Creating/Updating Public IP "lsml-public-ip" (Resource Group "lsml-resource-group"): network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="DnsRecordInUse" Message="DNS record alexlsmlmachine.westus.cloudapp.azure.com is already used by another public IP." Details=[][0m

[0m  on vm.tf line 4, in resource "azurerm_public_ip" "lsml_pub_ip":
   4: resource "azurerm_public_ip" "lsml_pub_ip" [4m{[0m
[0m
[0m[0m
