In [1]:
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [85, 90, 95, 80]
}, index=['a', 'b', 'c', 'd'])

print(df)

      name  score
a    Alice     85
b      Bob     90
c  Charlie     95
d    David     80


In [2]:
df.loc['a']

name     Alice
score       85
Name: a, dtype: object

In [9]:
df.loc[:, 'score']

a    85
b    90
c    95
d    80
Name: score, dtype: int64

In [10]:
df.loc['a']

name     Alice
score       85
Name: a, dtype: object

In [15]:
df.iloc[:,0]

a      Alice
b        Bob
c    Charlie
d      David
Name: name, dtype: object

What’s the output of this code?
```Python
import numpy as np

arr = np.array([1, 2, 3, 4])
print(arr[arr > 2])
```

## 1. Choose the correct statements about Python (multiple answers OK)
    □ Python uses garbage collection to manage memory
    □ Python variables must be declared with a type
    □ Python functions can have default arguments
    □ Python supports multi-threading without the GIL
    □ Python’s with statement is used for context management

    1,3,5

## 2. What does this code output?
```Python
x = [1, 2, 3]
y = x
x.append(4)
print(y)

## 3. SQL Logic
You have a table employees with columns:

* id (int)

* name (varchar)

* salary (float)

* department (varchar)

Write a SQL query to return the name of the employee(s) with the highest salary per department.

```SQL
SELECT e.name, e.salary, e.department
FROM employees e
JOIN (
    SELECT department, MAX(salary) AS max_salary
    FROM employees
    GROUP BY department
) m ON e.department = m.department AND e.salary = m.max_salary;

SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary = (
    SELECT MAX(salary)
    FROM employees
    WHERE department = e.department
);

## 4. Code a function to flatten a nested list.
Input:

```Python
[[1, 2], [3, 4, [5, 6]], 7]

Expected output: [1, 2, 3, 4, 5, 6, 7]

In [67]:
from functools import reduce

def flatten_it(lst):
    result = []
    for item in lst:
        if isinstance(item, list):
            result.extend(flatten_it(item))
        else:
            result.append(item)
    return result

In [76]:
def flatten_it(lst):
    for item in lst:
        if isinstance(item, list):
            yield from flatten_it(item)
        else:
            yield item

In [78]:
lst = [[1, 2], [3, 4, [5, 6]], 7]

list(flatten_it(lst))

[1, 2, 3, 4, 5, 6, 7]

## 5. What will this code return?
```python
import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'B', 'A', 'B'],
    'value': [10, 20, 30, 40]
})

print(df.groupby('group')['value'].sum())

□ A: 40, B: 60
□ A: 20, B: 60
□ A: 30, B: 40
□ Error

40, 60

## 6. Write a Bash command to count the number of lines in a large CSV file that contain the word “ERROR” (case-insensitive).

```bash
grep -i "ERROR" file.csv | wc -l
```
or without matching lines (just count)
```bash
grep -i -c "ERROR" file.csv
```

## 7. Using requests, write a function that retries 3 times if the API call fails (status != 200).

In [None]:
import requests

def fetch_data(url):
    try_number = 1
    while try_number <= 3:
        response = requests.get(url)
        if response.status_code != 200:
            print(f'Try #{try_number}: API call failed with status {response.status_code}')
            try_number += 1
        else:
            return response.json()
        
    raise Exception(f"API call failed after 3 attempts")

## 8. What is the difference between INNER JOIN and LEFT JOIN in SQL? Provide an example if needed.

```sql

SELECT *
FROM users
INNER JOIN orders ON users.id = orders.user_id;
```
that will return only users who have placed orders (orders is another table with users_id).

```sql
SELECT *
FROM users
LEFT JOIN orders ON users.id = orders.user_id;
```
this one returns all users but for those who have not done any order there will be NULL or NAN 

## 9. Using Python, how would you find the 5 most common words in a large text file?

In [23]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col, lower


spark = SparkSession.builder \
    .appName('read_file') \
    .getOrCreate()


df = spark.read.text('./data/HC-5059_ID06-LVP.txt')


df_words = df.withColumn('words', split(col('value'), ' '))
df_exploded = df_words.select(explode(col('words')).alias('word')) \
                      .withColumn('word', lower(col('word'))) \
                      .filter(col('word') != '')


top_words = df_exploded.groupBy('word') \
    .count() \
    .orderBy(col('count').desc()) \
    .limit(5)

top_words.show()

+----------+-----+
|      word|count|
+----------+-----+
|      been|   88|
|       has|   88|
|   dataset|   87|
|2023-02-27|   34|
|        to|   30|
+----------+-----+



## 10. Which of the following are Python libraries for data manipulation and analysis? (multiple OK)
    □ NumPy
    □ Pandas
    □ TensorFlow
    □ SQLAlchemy
    □ OpenCV

    Pandas, SQLalchemy, numpy

## 11. Write a Python function that reads a JSON file and returns all keys at the top level.

In [31]:
import json

def get_json_keys(json_file):
    with open(json_file, 'r') as f:
        data = json.load(f)
        return list(data.keys())

## 12. Given the following log line:
```pgsql
2023-09-10 15:24:02, INFO, Job started by user: admin
```
Write a regular expression to extract the date, log level, and user.

In [45]:
import re

def date_log_user(log: str):
    pattern = r"(\d{4}-\d{2}-\d{2}) \d{2}:\d{2}:\d{2}, (\w+), .*user: (\w+)"
    match = re.search(pattern, log)
    if match:
        date, level, user = match.groups()
        return [date, level, user]
    else:
        return None

In [46]:
logs = '2023-09-10 15:24:02, INFO, Job started by user: admin'
date_log_user(logs)

['2023-09-10', 'INFO', 'admin']

## 13. What’s the difference between batch and stream processing? When would you use one over the other?

In batch Processing data is collected over a period of time and then processed all at once (good for daily reports, dashboards)

For Streaming data is collected live (and also processed in real-time) and can be used in IoT, fraud detection, when it is important to make decisions during data accusition)

## 15. What’s the difference between a DataFrame and a Series in pandas?

Series is 1D, its a labeld 1D array

DataFrame is a 2D Table, dictionary of Series with column names as keys

# Software Developer in Data & Machine Learning – Technical Test (LLM Focus)

## 1. Multiple Choice: General Python and ML Concepts
Which of the following statements are TRUE? (multiple answers OK)

    □ A Python function can return multiple values
    □ numpy is often used for data visualization
    □ scikit-learn provides tools for deep learning
    □ transformers is a library by Hugging Face
    □ Tokenization is the process of converting strings into numerical IDs

    1, 4, 5

## 2. What does the following code print?
```python

from transformers import pipeline

qa = pipeline("question-answering")
context = "Python is a popular programming language created by Guido van Rossum."

result = qa(question="Who created Python?", context=context)
print(result["answer"])
```

Guido van Rossum

## 3. Write a function that takes a text string and returns the most frequent bigram (pair of consecutive words)

In [3]:
from collections import Counter

def most_frequent_bigram(text):

    words = text.lower().split()
    bigrams = list(zip(words, words[1:]))
    bigram_counts = Counter(bigrams)
    most_common = bigram_counts.most_common(1)
    
    return most_common[0][0] if most_common else None

## 4. What’s the difference between batch size, epoch, and learning rate in training ML models?

batch size is a number of samples the model process at once before update the weights (activation and backpropagation)
*   total number of points is 1000 and if batchsize is 100 then the model will do 10 calcultaions/updates
-------
epoch is total pass through the data pints
*   so in the last example all 10 updates would be 1 epoch so multiple epochs are needed for good performance
-------
learning rate is a coefficient in weights backpropagation:
*   small lead to smaler weight changes -> slower training (or no training at all)
*   big lead to larger weight updates -> faster training (but can result in not convergency, overshooting)