Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShingleBased.GetProfile is not public #21

Closed
kentcb opened this issue Sep 10, 2020 · 6 comments
Closed

ShingleBased.GetProfile is not public #21

kentcb opened this issue Sep 10, 2020 · 6 comments

Comments

@kentcb
Copy link

kentcb commented Sep 10, 2020

Hi,

Just trying this library out and the README gives this example:

string s1 = "My first string";
string s2 = "My other string...";
        
// Let's work with sequences of 2 characters...
var cosine = new Cosine(2);
        
// For cosine similarity I need the profile of strings
StringProfile profile1 = cosine.GetProfile(s1);
StringProfile profile2 = cosine.GetProfile(s2);
        
// Prints 0.516185
Console.WriteLine(profile1.CosineSimilarity(profile2));

However, this doesn't compile because ShingleBased.GetProfile is protected and because StringProfile no longer appears to be a thing (or maybe that's only in the Java code...?).

Anyway, I can subclass Cosine to hackily gain access to GetProfile, but then I can't really do anything meaningful with it because there is no StringProfile.CosineSimilarity available. I ended up having to copy/paste Cosine in its entirety and hacking in a GetProfile and CosineSimilarity method to gain access to this functionality:

public class Cosine : ShingleBased, INormalizedStringSimilarity, INormalizedStringDistance
{
	public Cosine(int k) : base(k) { }

	public Cosine() { }

	public double Similarity(string s1, string s2)
	{
		if (s1 == null)
		{
			throw new ArgumentNullException(nameof(s1));
		}

		if (s2 == null)
		{
			throw new ArgumentNullException(nameof(s2));
		}

		if (s1.Equals(s2))
		{
			return 1;
		}

		if (s1.Length < k || s2.Length < k)
		{
			return 0;
		}

		var profile1 = GetProfile(s1);
		var profile2 = GetProfile(s2);

		return CosineSimilarity(profile1, profile2);
	}
	
	public double CosineSimilarity(IDictionary<string, int> profile1, IDictionary<string, int> profile2) =>
		DotProduct(profile1, profile2) / (Norm(profile1) * Norm(profile2));

	private static double Norm(IDictionary<string, int> profile)
	{
		double agg = 0;

		foreach (var entry in profile)
		{
			agg += 1.0 * entry.Value * entry.Value;
		}

		return Math.Sqrt(agg);
	}

	private static double DotProduct(IDictionary<string, int> profile1,
		IDictionary<string, int> profile2)
	{
		// Loop over the smallest map
		var small_profile = profile2;
		var large_profile = profile1;

		if (profile1.Count < profile2.Count)
		{
			small_profile = profile1;
			large_profile = profile2;
		}

		double agg = 0;
		foreach (var entry in small_profile)
		{
			if (!large_profile.TryGetValue(entry.Key, out var i)) continue;

			agg += 1.0 * entry.Value * i;
		}

		return agg;
	}

	public double Distance(string s1, string s2)
		=> 1.0 - Similarity(s1, s2);

	public double Similarity(IDictionary<string, int> profile1, IDictionary<string, int> profile2)
		=> DotProduct(profile1, profile2)
		/ (Norm(profile1) * Norm(profile2));
		
	public new IDictionary<string, int> GetProfile(string s) =>
		base.GetProfile(s);
}
@paulirwin
Copy link
Member

Thanks for reporting this. That is likely a peculiarity of the port from Java (the examples were also ported). We'll take a look. Thanks!

@QuirijnLangedijk
Copy link

QuirijnLangedijk commented Feb 19, 2021

I guess that you havent had the time for this yet?

paulirwin added a commit that referenced this issue Feb 19, 2021
…tation and add unit test for Cosine similarity. #21
@paulirwin
Copy link
Member

My apologies for this (and the other PRs here) for slipping through the cracks in the crazy year that was 2020. But there's no need to be sarcastic or rude here. Asking for a status update is fine, trolling is not.

I'm preparing an updated release with this fix. Basically the new pattern will be:

using System;
using F23.StringSimilarity;

public class Program
{
    public static void Main(string[] args)
    {
        string s1 = "My first string";
        string s2 = "My other string...";
        
        // Let's work with sequences of 2 characters...
        var cosine = new Cosine(2);
        
        // For cosine similarity I need the profile of strings
        var profile1 = cosine.GetProfile(s1);
        var profile2 = cosine.GetProfile(s2);
        
        // Prints 0.516185
        Console.WriteLine(cosine.Similarity(profile1, profile2));
    }
}

The removal of StringProfile was correct as GetProfile now returns IDictionary<string, int>, but GetProfile should be public to match the upstream project. The documentation will be updated to reflect that you now call cosine.Similarity(profile1, profile2) and I've added a unit test as well that ensures this behavior.

@paulirwin
Copy link
Member

Hi @kentcb - I just published 4.0.0-beta to NuGet with this change included. Once it's available, can you pull and see if this resolves your issue? Note that this version also updates to .NET Standard 2.0, which drops support for the out-of-support .NET Core 1.x.

@kentcb
Copy link
Author

kentcb commented Mar 3, 2021

@paulirwin thanks very much for fixing this. As it happens, I've moved on from the job where I was using this. Happy for you to close this if you're happy 🙂

@paulirwin
Copy link
Member

v4.0.0 released with this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants