
    This file is created for some analysis of the following:
        1 - Accuracy of siminets
        2 - Accuracy of database usage.
    caveat: This will be done with a dataset created by participants
    of this project and this dataset is thought to be stronly biased.
    
    First, Importing modules:

In [2]:
from analysis_tools import dataset_scraper
from packages.pipes import prefabs

from packages.pipes.collection.feed_disk import FeedFromDiskPipe
from packages.pipes.collection.cleaning import CleaningPipe
from packages.pipes.collection.simi import SimiPipe



    The custom dataset needs to be converted into "tweet" objects which
    have a functionally similar structure as tweepy tweets.
    They are then saved as a list in a pickle file. The reason for doing 
    this is that it is an easy fromat to handle using the 'pipes' in this project. 
    

In [3]:
# // Load 'tweet' objects.
TOPICS = ["Art", "Food", "Incident", "IT", "Nature", "Sport"]  # // Existing topics.
PICKLE_FILE = "./analysis_tools/dummyset.pickle"

dataset_pre_pickle = dataset_scraper.fetch_data_all(
    path="./analysis_tools/Data",
    topics=TOPICS
)

# // Check that everything got loaded, should be 242.
print(len(dataset_pre_pickle))

# // Assign unique user id's. They are functionally used as primary keys in the DB.
dataset_scraper.assign_uid(dataset_pre_pickle)

# // Pickle
dataset_scraper.save(
    data=dataset_pre_pickle,
    filename=PICKLE_FILE
)

242



    At this stage, there should be a dataset at the
    path specified above (when doing pickeling). Next cell will
    set up three pipes to do the following: 
         - Fetching the dataset.
         - Converting to dataobjects and cleaning.
         - Adding siminets to the dataobjects
    

In [3]:
# // Pipe setup
pipe_dsk = FeedFromDiskPipe(
    filepath=PICKLE_FILE,
)
pipe_cln = CleaningPipe(
    previous_pipe=pipe_dsk,
)
pipe_simi = SimiPipe(
    previous_pipe=pipe_cln,
    recursion_level=1
)
pipes = [pipe_dsk, pipe_cln, pipe_simi]


    Now that the pipes are set up, they can loop
    through their content and process everything.
    A thing of note; pipes try to keep their internal
    data count (self.output) below a certain amount.
    The default is 200 (self.threshold_output),
    so the data should be moved somewhere when it's
    done (variable below; 'DATA')


In [4]:
DATA = [] 
while True:
    for pipe in pipes:
        pipe.process()
    # // Move data to DATA when it is processed.
    if pipe_simi.output: 
        DATA.append(
            pipe_simi.output.pop()
        )
    else: # Check for break here, doing it in while statement won't work.
        break
    print(f"Processed object number: {len(DATA)}", end="\r")


Processed object number: 242


    As all dataobjects have a similarity net at this stage
    some analysis can be done. First part of the analysis
    will be by gauging the effectiveness of similarity nets.
    
    This will require some queries which are converted
    into similarity nets. To achieve this, the similarity
    tool inside pipe_simi can be borrowed (to avoid
    loading it again).
    

In [5]:
SIMITOOL = pipe_simi.simitool # // For convenience.

# // Create dict for queries.
QUERY_SIMI = { 
    topic: SIMITOOL.get_similarity_net(
        query=[topic.lower()],
        max_recursion=1
    ) 
        for topic in TOPICS
}
# // Sample a value.
print(QUERY_SIMI.get(TOPICS[0])[0])



['collection', 0.8777222037315369]



    For convenience, all dataobjects in DATA will be sorted
    into a dictionary with the following format:
        'topic':composite_siminet'
        

In [6]:
data_dict = {}
for dataobj in DATA:
    name = dataobj.name[1:] # Bug correction; names start with a space
    siminet = dataobj.siminet
    if name in data_dict:
        lst = data_dict[name]
        lst.extend(siminet)
    else:
        data_dict[name] = siminet



    At this stage, the actual result calculation will
    be performed. Each siminet of all queries will be
    compared against all siminets of dataobjects and
    the result will be stored in 'RESULT' with the format: 
        what_was_searched:dict(compared_topic:score)


In [7]:
# // Create dict of results.
RESULT = {}
for q_key in QUERY_SIMI:
    q_simi = QUERY_SIMI.get(q_key)
    
    tmp = {} # // Outer dict
    for d_key in data_dict:
        d_simi = data_dict.get(d_key)
        score = SIMITOOL.get_score_compressed_siminet(
            new = q_simi,
            other = d_simi
        )
        # // Progress printout for convenience.
        print(f"{q_key} vs {d_key} = {score}", end='\r')
        tmp[d_key] = score
    RESULT[q_key] = tmp # // Add outer dict to inner.

Sport vs Sport = 115.8812288045883282363



    At this stage, the results for siminet accuracy
    should be stored in 'RESULT'. The next cell will
    do a printout.
    



In [8]:
# // Calculate padding for printout:
padding = 0
for key in RESULT:
    if len(key) > padding:
        padding = len(key)

# // Printout
for outer_key in RESULT:
    print(f"Query category: '{outer_key}'")
    outer_dict = RESULT.get(outer_key)
    for inner_key in outer_dict:
        whitespace = padding - len(inner_key) + 5
        
        print(f"\t Tweet category '{inner_key}'"
                 f"{whitespace*' '} got score: '{outer_dict.get(inner_key)}'")   


Query category: 'Art'
	 Tweet category 'Art'           got score: '407.8573306798935'
	 Tweet category 'Food'          got score: '0'
	 Tweet category 'Incident'      got score: '0'
	 Tweet category 'IT'            got score: '1.7995017170906067'
	 Tweet category 'Nature'        got score: '12.358308136463165'
	 Tweet category 'Sport'         got score: '1.7833257913589478'
Query category: 'Food'
	 Tweet category 'Art'           got score: '0'
	 Tweet category 'Food'          got score: '148.80612349510193'
	 Tweet category 'Incident'      got score: '0'
	 Tweet category 'IT'            got score: '0'
	 Tweet category 'Nature'        got score: '5.274604618549347'
	 Tweet category 'Sport'         got score: '0'
Query category: 'Incident'
	 Tweet category 'Art'           got score: '0'
	 Tweet category 'Food'          got score: '0'
	 Tweet category 'Incident'      got score: '44.58121180534363'
	 Tweet category 'IT'            got score: '0'
	 Tweet category 'Nature'        got score: 



    -----------
    As all values have been gathered, some percentages can be
    shown as well.
    
    


In [9]:

# // Collection.
result_percent = {}
for o_key in RESULT:
    inner_dict = RESULT.get(o_key)
    # // Sum all.
    inner_total = 0
    for i_key in inner_dict:
        inner_total += inner_dict.get(i_key)
    percent = (inner_dict.get(o_key) / inner_total)
    # // Save.
    result_percent[o_key] = percent
    
# // Printout.
for key in result_percent:
    print(f"{key} : {result_percent.get(key)}")
    


Art : 0.962385103034575
Food : 0.9657672657500764
Incident : 1.0
IT : 0.19640876703497692
Nature : 0.7275838338425028
Sport : 0.9165024541421779




    -----------
    Conclusion
    ----------
    After testing with recursion levels 1-2(inclusive) on dataobject siminet
    and 1-3(also inclusive), there are a few remarks to be made:
    
    NOTE: All stats were calculated with a float cut-off after second decimal.
    
    The absolute worst performance was with the category 'IT', while
    the absolute best performance was with the category 'Incident'.
    This was a recurring pattern through all tests.
        Absolute worst; 15% with 'IT'
        Absolute best; 100% with 'Incident' (floating point cut-off - actually 99.x)
    
        Average with obj rec 1 query req 1: 76%
        Average with obj rec 1 query req 2: 67%
        Average with obj rec 1 query req 3: 53%
        
        Average with obj rec 2 query req 1: 69%
        Average with obj rec 2 query req 2: 60%
        Average with obj rec 2 query req 3: 45%
        
    
    
    More details:
    
            Art, Food, Incident, IT, Nature, Sport (columns)
    
        rec obj 1, rec query 1
	
            0.96 + 0.96 + 1 + 0.19 + 0.72 + 0.9 = 4.73
            4.73 / 6 = 0.78
    
        rec obj 1, rec query 2
	
            0.82 + 0.89 + 0.94 + 0.15 + 0.38 + 0.87 = 4.05
            4.05 / 6 = 0.67

        rec obj 1, rec query 3

            0.65 + 0.79 + 0.72 + 0.15 + 0.23 + 0.67 = 3.21
            3.21 / 6 = 0.53

        rec obj 2, rec query 1

            0.85 + 0.86 + 0.98 + 0.15 + 0.5 + 0.8 = 4.14
            4.14 / 6 = 0.69

        rec obj 2, rec query 2

            0.75 + 0.76 + 0.86 + 0.16 + 0.36 + 0.74 = 3.63
            3.63 / 6 = 0.60

        rec obj 3, rec query 3

            0.52 + 0.66 + 0.64 + 0.16 + 0.22 + 0.54 = 2.74
            2.74 / 6 = 0.45



  ---------------------------------------------------------------
  ---------------------------------------------------------------
  ---------------------------------------------------------------
  ---------------------------------------------------------------
  ---------------------------------------------------------------
  ---------------------------------------------------------------

            This section is for inserting the dataobjects 
            created above into neo4j. It is important to
            have all credentials in order (credentials file)
            and start the server.

            First, the pickle file will be saved again, with
            fever objects. This is because it is difficult to
            see what's going on in the database with more than
            200 nodes.


In [10]:

# // Load 'tweet' objects.
TOPICS = ["Incident", "IT"]  # // Existing topics.
PICKLE_FILE = "./analysis_tools/dummyset.pickle"



incident_tweets = dataset_scraper.fetch_data_by_topic(
    path="./analysis_tools/Data",
    topic="Incident"
)

it_tweets = dataset_scraper.fetch_data_by_topic(
    path="./analysis_tools/Data",
    topic="IT"
)


# // Reduce count.
incident_tweets = incident_tweets[int(len(incident_tweets)/2):]
it_tweets = it_tweets[int(len(it_tweets)/2):]

# // Check that everything got loaded, should be 242.
print(len(incident_tweets))
print(len(it_tweets))

# # // Assign unique user id's. They are functionally used as primary keys in the DB.
dataset_scraper.assign_uid(incident_tweets)
dataset_scraper.assign_uid(it_tweets)

# // Combine before save.
combined = []
combined.extend(incident_tweets)
combined.extend(it_tweets)

# // Pickle
dataset_scraper.save(
    data=combined,
    filename=PICKLE_FILE
)


20
20


            The rest is just running the pipe. Note:
                - There's currently an issue with quitting the pipe,
                  it has to crash.
                - There's a default node count limit (25 nodes) when 
                  viewing the databse (at localhost:7474). This can
                  be fixed by typing this into the command:
                      'MATCH (n) RETURN n LIMIT 100'
                      
                  100 is arbitrary, but if the default code is ran
                  in this notebook, 40 should be enough.



In [11]:
pipeline = prefabs.get_pipeline_dsk_cln_simi_db(
    filepath=PICKLE_FILE,
    rec_lvl=1,
    
).run()

FeedFromDiskPipe: 0     CleaningPipe: 0     SimiPipe: 0     DBPipe: 0                                                                               