<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Lecture 2 Day 1</div>
<div style="text-align: right">Dino Konstantopoulos, 18 January 2024</div>

# Dictionary Problem Homework
Designed to help you learn container types, list comprehensions, and that functional data structure called *dictionaries* that replaces Objects in OO programming! 

We are going to use python dictionaries to help us learn Chinese and Hindi.

Every time we find an interesting english sentence to translate, we use [google translate](https://translate.google.com/) to translate it to hindi and chinese, and we store the translations in a dictionary, keyed by the time we enter the data and a random guid.

In [2]:
!pip install bson

Collecting bson
  Downloading bson-0.5.10.tar.gz (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: bson
  Building wheel for bson (setup.py): started
  Building wheel for bson (setup.py): finished with status 'done'
  Created wheel for bson: filename=bson-0.5.10-py3-none-any.whl size=11989 sha256=9795eaba9fd6a822788d239c9a9ac7036c140b2bfc3cca575b1e29c6d988bb71
  Stored in directory: c:\users\abhinav uni\appdata\local\pip\cache\wheels\cb\f3\45\c859e83339943dfe2f43e1c9aaebdc00db321191a6fe120947
Successfully built bson
Installing collected packages: bson
Successfully installed bson-0.5.10


In [3]:
from bson.objectid import ObjectId
from datetime import datetime
import random

random.seed(3)

def prefix_crud_timestamp_suffix(key):
    prefix = key[:3]
    crud = key[3:4]
    #hyphens = [i for i in range(len(key[:4])) if key[:4].startswith('-'', i)]
    hyphen1 = key.find('-')
    hyphen2 = key[5:].find('-')
    timestamp = key[hyphen1+1:hyphen1+1+hyphen2]
    suffix = key[hyphen1+hyphen2+2:]
    return prefix, crud, timestamp, suffix #coll, op, time, guid

## seconds since midnight, simulate non-contiguous times
def ssm():
    now = datetime.now()
    midnight = now.replace(hour=0, minute=0, second=0, microsecond=0)
    return str((now - midnight).seconds + random.randint(0, 1000))

words = dict()
def enter_words(en, zh = None, hi = None):
    uid = ('zhon-' if zh != None else 'hind-' if hi != None else 'oops-') + ssm() + '-' + str(ObjectId())
    words[uid] = (
        dict(english = en, chinese = zh, _id = uid) if zh != None else 
        dict(english = en, hindi = hi, _id = uid) if hi != None else
        dict(_id = uid)
    )

Here's the structure of our key for an example translation: The first part is the language, the second part is the time (as an integer counter), the third part is a guid (random string).

In [4]:
en = """If a person has not had a chance to acquire his target language by the time he's an adult, 
he's unlikely to be able to reach native speaker level in that language"""
zh = '如果一個人在成人前沒有機會習得目標語言，他對該語言的認識達到母語者程度的機會是相當小的'
('zhon-' if zh != None else 'hind-' if hi != None else 'oops-') + ssm() + '-' + str(ObjectId())

'zhon-80341-65b1d232e1eb5f6cdcf72715'

We are going to *simulate* the data entering process. I'll give you two files with translations of english sentences, one for chinese, another for hindi (from my NLP class):

In [6]:
file1 = open('cmn.txt', 'r', encoding='utf8')
lines = file1.readlines()
file1.close()

for i,l in enumerate(lines):
    t2 = l.split('\t')
    enter_words(t2[0][:-1], zh = t2[1][:-1])

In [7]:
file1 = open('hin.txt', 'r', encoding='utf8')
lines = file1.readlines()
file1.close()

for i,l in enumerate(lines):
    t2 = l.split('\t')
    enter_words(t2[0][:-1], hi = t2[1][:-1])

Dictionaries *have no built-in ordering*! That means that if you enumerate on dictionary items, they will appear *unordered*:

In [8]:
print([(u,v) for i,(u,v) in enumerate(words.items()) if i < 10 ])

[('zhon-80731-65b1d24de1eb5f6cdcf72716', {'english': 'Hi', 'chinese': '嗨', '_id': 'zhon-80731-65b1d24de1eb5f6cdcf72716'}), ('zhon-80682-65b1d24de1eb5f6cdcf72717', {'english': 'Hi', 'chinese': '你好', '_id': 'zhon-80682-65b1d24de1eb5f6cdcf72717'}), ('zhon-80258-65b1d24de1eb5f6cdcf72718', {'english': 'Run', 'chinese': '你用跑的', '_id': 'zhon-80258-65b1d24de1eb5f6cdcf72718'}), ('zhon-80503-65b1d24de1eb5f6cdcf72719', {'english': 'Wait', 'chinese': '等等', '_id': 'zhon-80503-65b1d24de1eb5f6cdcf72719'}), ('zhon-81062-65b1d24de1eb5f6cdcf7271a', {'english': 'Wait', 'chinese': '等一下', '_id': 'zhon-81062-65b1d24de1eb5f6cdcf7271a'}), ('zhon-80743-65b1d24de1eb5f6cdcf7271b', {'english': 'Hello', 'chinese': '你好', '_id': 'zhon-80743-65b1d24de1eb5f6cdcf7271b'}), ('zhon-80610-65b1d24de1eb5f6cdcf7271c', {'english': 'Dino', 'chinese': '迪诺', '_id': 'zhon-80610-65b1d24de1eb5f6cdcf7271c'}), ('zhon-80765-65b1d24de1eb5f6cdcf7271d', {'english': 'I try', 'chinese': '让我来', '_id': 'zhon-80765-65b1d24de1eb5f6cdcf7271d'}),

So now I have *one* dictionary for *both* chinese and hindi!

Let's separate them into two dictionaries:

In [9]:
separated = dict()
separated['chinese'] = {k:v for k,v in words.items() if k.startswith('zhon')}
separated['hindi'] = {k:v for k,v in words.items() if k.startswith('hind')}

In [10]:
print([(u,v) for i,(u,v) in enumerate(separated['hindi'].items()) if i < 10 ])

[('hind-80377-65b1d252e1eb5f6cdcf77d52', {'english': 'Wow', 'hindi': 'वाह', '_id': 'hind-80377-65b1d252e1eb5f6cdcf77d52'}), ('hind-80736-65b1d252e1eb5f6cdcf77d53', {'english': 'Help', 'hindi': 'बचाओ', '_id': 'hind-80736-65b1d252e1eb5f6cdcf77d53'}), ('hind-80965-65b1d252e1eb5f6cdcf77d54', {'english': 'Jump', 'hindi': 'उछलो', '_id': 'hind-80965-65b1d252e1eb5f6cdcf77d54'}), ('hind-80740-65b1d252e1eb5f6cdcf77d55', {'english': 'Jump', 'hindi': 'कूदो', '_id': 'hind-80740-65b1d252e1eb5f6cdcf77d55'}), ('hind-80328-65b1d252e1eb5f6cdcf77d56', {'english': 'Jump', 'hindi': 'छलांग', '_id': 'hind-80328-65b1d252e1eb5f6cdcf77d56'}), ('hind-80277-65b1d252e1eb5f6cdcf77d57', {'english': 'Hello', 'hindi': 'नमस्ते', '_id': 'hind-80277-65b1d252e1eb5f6cdcf77d57'}), ('hind-80637-65b1d252e1eb5f6cdcf77d58', {'english': 'Hello', 'hindi': 'नमस्कार', '_id': 'hind-80637-65b1d252e1eb5f6cdcf77d58'}), ('hind-80742-65b1d252e1eb5f6cdcf77d59', {'english': 'Cheers', 'hindi': 'वाह-वाह', '_id': 'hind-80742-65b1d252e1eb5f6cd

The key has the format `language-time-randomguid` (we simulated `time` by adding a random number to number of seconds since midnight). Suppose I want to be able to practice my sentences every day in the (simulated) order that we saved them, and that every day, I want to be able to *see a specific number of sentences with a time greater than a specific time* (entered as an integer: number of seconds from midnight).

- **Question 1 (50 points)**: Given how many sentences I want to see (variable `n`), and a certain time of the day specified as number of seconds past midnight (variable `ssm`) write code that yields *the next `n` (given as input) sentences of both the chinese and hindi dictionaries, past a certain specified `ssm` that represents a time (given as input)*. Structure the result as a **dictionary** with two keys: `chinese` and `hindi`.

- **Question 2 (50 points)**: Rewrite your code in Dua Lipa style: In the smallest number of lines of python (e.g. a few!). Line continuations are allowed. For example:
```
{
    'chinese' : {k:v for k,v in .....},
    'hindi' : {k:v for k,v in .....},
}
```
counts for one line of code.

Time your Dua Lipa code with `%%time`. Shorter times combined with most beautiful Dua Lipa code get best grades :-)

In [15]:
def get_sentences(n, ssm):
    
    #Filter results with document IDs greater than ssm
    chinese_results = {k: v for k, v in separated['chinese'].items() if int(v['_id'].split('-')[1]) > ssm}
    hindi_results = {k: v for k, v in separated['hindi'].items() if int(v['_id'].split('-')[1]) > ssm}
    
    #Select the first n items from the filtered results
    hindi = dict(list(hindi_results.items())[:n])
    chinese = dict(list(chinese_results.items())[:n])
    
    return {'hindi': hindi,'chinese': chinese}

In [16]:
output = get_sentences(n=10, ssm=10000)
print(output)

{'hindi': {'hind-80377-65b1d252e1eb5f6cdcf77d52': {'english': 'Wow', 'hindi': 'वाह', '_id': 'hind-80377-65b1d252e1eb5f6cdcf77d52'}, 'hind-80736-65b1d252e1eb5f6cdcf77d53': {'english': 'Help', 'hindi': 'बचाओ', '_id': 'hind-80736-65b1d252e1eb5f6cdcf77d53'}, 'hind-80965-65b1d252e1eb5f6cdcf77d54': {'english': 'Jump', 'hindi': 'उछलो', '_id': 'hind-80965-65b1d252e1eb5f6cdcf77d54'}, 'hind-80740-65b1d252e1eb5f6cdcf77d55': {'english': 'Jump', 'hindi': 'कूदो', '_id': 'hind-80740-65b1d252e1eb5f6cdcf77d55'}, 'hind-80328-65b1d252e1eb5f6cdcf77d56': {'english': 'Jump', 'hindi': 'छलांग', '_id': 'hind-80328-65b1d252e1eb5f6cdcf77d56'}, 'hind-80277-65b1d252e1eb5f6cdcf77d57': {'english': 'Hello', 'hindi': 'नमस्ते', '_id': 'hind-80277-65b1d252e1eb5f6cdcf77d57'}, 'hind-80637-65b1d252e1eb5f6cdcf77d58': {'english': 'Hello', 'hindi': 'नमस्कार', '_id': 'hind-80637-65b1d252e1eb5f6cdcf77d58'}, 'hind-80742-65b1d252e1eb5f6cdcf77d59': {'english': 'Cheers', 'hindi': 'वाह-वाह', '_id': 'hind-80742-65b1d252e1eb5f6cdcf77d

In [18]:
%%time

#Using dictionary comprehensions to filter Chinese and Hindi results based on document IDs

{
    'chinese': {k: v for k, v in separated['chinese'].items() if int(v['_id'].split('-')[1]) > 10000},
    'hindi': {k: v for k, v in separated['hindi'].items() if int(v['_id'].split('-')[1]) > 10000},
}

CPU times: total: 15.6 ms
Wall time: 13 ms


{'chinese': {'zhon-80731-65b1d24de1eb5f6cdcf72716': {'english': 'Hi',
   'chinese': '嗨',
   '_id': 'zhon-80731-65b1d24de1eb5f6cdcf72716'},
  'zhon-80682-65b1d24de1eb5f6cdcf72717': {'english': 'Hi',
   'chinese': '你好',
   '_id': 'zhon-80682-65b1d24de1eb5f6cdcf72717'},
  'zhon-80258-65b1d24de1eb5f6cdcf72718': {'english': 'Run',
   'chinese': '你用跑的',
   '_id': 'zhon-80258-65b1d24de1eb5f6cdcf72718'},
  'zhon-80503-65b1d24de1eb5f6cdcf72719': {'english': 'Wait',
   'chinese': '等等',
   '_id': 'zhon-80503-65b1d24de1eb5f6cdcf72719'},
  'zhon-81062-65b1d24de1eb5f6cdcf7271a': {'english': 'Wait',
   'chinese': '等一下',
   '_id': 'zhon-81062-65b1d24de1eb5f6cdcf7271a'},
  'zhon-80743-65b1d24de1eb5f6cdcf7271b': {'english': 'Hello',
   'chinese': '你好',
   '_id': 'zhon-80743-65b1d24de1eb5f6cdcf7271b'},
  'zhon-80610-65b1d24de1eb5f6cdcf7271c': {'english': 'Dino',
   'chinese': '迪诺',
   '_id': 'zhon-80610-65b1d24de1eb5f6cdcf7271c'},
  'zhon-80765-65b1d24de1eb5f6cdcf7271d': {'english': 'I try',
   'chinese'